ONESPACE: Detecting cross-language clones by learning a common embedding space

JOURNAL OF SYSTEMS AND SOFTWARE(2024)

引用 0|浏览5
暂无评分
摘要
Identifying clone code fragments across different languages can enhance the productivity of software developers in several ways. However, the clone detection task is often studied in the context of a single language and less explored for code snippets spanning different languages. In this paper, we present ONESPACE, a new cross language clone detection approach. ONESPACE projects different programming languages to the same embedding space using both code and API data. ONESPACE, hence, leverages a Siamese Network to infer the similarity of the embedded programs. We evaluate ONESPACE by detecting clones across three language pairs; JAVA-Python, Java-C++ and Java-C. We compared ONESPACE with the other state-of-art techniques, SuPLEArN and CLCDSA. In our evaluation, ONESPACE provided higher effectiveness than the state of the art. Our ablation study validated some of our intuitions in designing ONESPACE, particularly that using a single embedding space (as opposed to separate ones) provides higher effectiveness. Additionally, we designed a variant of ONESPACE that uses Word Mover-Distance Algorithm and provides lower effectiveness, but is much more efficient. We also found that ONESPACE provides higher effectiveness than the state of the art, even for: complex implementations, single method implementations, varying ratios of positive to negative clones in training, varying amounts of training data, and for additional programming languages.
更多
查看译文
关键词
Clone detection,Siamese neural networks,Word vector,Embedding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要