Clone-Seeker: Effective Code Clone Search Using Annotations

Source code search plays an important role in software development, e.g. for exploratory development or opportunistic reuse of existing code from a code base. Often, exploration of different implementations with the same functionality is needed for tasks like automated software transplantation, software diversification, and software repair. Code clones, which are syntactically or semantically similar code fragments, are perfect candidates for such tasks. Searching for code clones involves a given search query to retrieve the relevant code fragments. We propose a novel approach called Clone-Seeker that focuses on utilizing clone class features in retrieving code clones. For this purpose, we generate metadata for each code clone in the form of a natural language document. The metadata includes a pre-processed list of identifiers from the code clones augmented with a list of keywords indicating the semantics of the code clone. This keyword list can be extracted from a manually annotated general description of the clone class, or automatically generated from the source code of the entire clone class. This approach helps developers to perform code clone search based on a search query written either as source code terms, or as natural language. In our quantitative evaluation, we show that (1) Clone-Seeker has a higher recall when searching for semantic code clones (i.e., Type-4) in BigCloneBench than the state-of-the-art; and (2) Clone-Seeker can accurately search for relevant code clones by applying natural language queries.

[1]  Brad A. Myers,et al.  An Exploratory Study of How Developers Seek, Relate, and Collect Relevant Information during Software Maintenance Tasks , 2006, IEEE Transactions on Software Engineering.

[2]  Lee Martie,et al.  Understanding the impact of support for iteration on code search , 2017, ESEC/SIGSOFT FSE.

[3]  Weiguo Fan,et al.  Effective information retrieval using genetic algorithms based matching functions adaptation , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[4]  Michael W. Godfrey,et al.  Software bertillonage: finding the provenance of an entity , 2011, MSR '11.

[5]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[6]  Sushil Krishna Bajracharya,et al.  Sourcerer: a search engine for open source code supporting structure-based search , 2006, OOPSLA '06.

[7]  Tao Xie,et al.  Improving software quality via code searching and mining , 2009, 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation.

[8]  Michael W. Godfrey,et al.  “Cloning considered harmful” considered harmful: patterns of cloning in software , 2008, Empirical Software Engineering.

[9]  Giuseppe Bianco,et al.  Toxic Code Snippets on Stack Overflow , 2018, IEEE Transactions on Software Engineering.

[10]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  Ondrej Bojar,et al.  Keyphrase Generation: A Multi-Aspect Survey , 2019, 2019 25th Conference of Open Innovations Association (FRUCT).

[13]  Sushil Krishna Bajracharya,et al.  Analyzing and mining a code search engine usage log , 2010, Empirical Software Engineering.

[14]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[15]  Xiaochen Li,et al.  Query Expansion Based on Crowd Knowledge for Code Search , 2016, IEEE Transactions on Services Computing.

[16]  Beijun Shen,et al.  Lancer: Your Code Tell Me What You Need , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[17]  Mark van den Brand,et al.  DeepClone: Modeling Clones to Generate Code Predictions , 2020, ICSR.

[18]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[19]  Benoit Baudry,et al.  Tailored source code transformations to synthesize computationally diverse program variants , 2014, ISSTA 2014.

[20]  Rainer Koschke,et al.  A systematic mapping study of clone visualization , 2020, Comput. Sci. Rev..

[21]  Emily Hill,et al.  Improving source code search with natural language phrasal representations of method signatures , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[22]  Gabriele Bavota,et al.  Automatic query reformulations for text retrieval in software engineering , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[23]  Jacques Klein,et al.  FaCoY – A Code-to-Code Search Engine , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[24]  Chanchal Kumar Roy,et al.  The NiCad Clone Detector , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[25]  Seung-won Hwang,et al.  Towards an Intelligent Code Search Engine , 2010, AAAI.

[26]  Dawn Archer,et al.  Word frequency and key word statistics in corpus linguistics , 2009 .

[27]  Kathryn T. Stolee,et al.  How developers search for code: a case study , 2015, ESEC/SIGSOFT FSE.

[28]  David Lo,et al.  Query expansion via WordNet for effective code search , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[29]  Themistoklis G. Diamantopoulos,et al.  Employing Source Code Information to Improve Question-Answering in Stack Overflow , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[30]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Martin P. Robillard,et al.  What Makes APIs Hard to Learn? Answers from Developers , 2009, IEEE Software.

[33]  Zhenchang Xing,et al.  What do developers search for on the web? , 2017, Empirical Software Engineering.

[34]  Rastislav Bodík,et al.  Jungloid mining: helping to navigate the API jungle , 2005, PLDI '05.

[35]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[36]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[37]  Jian Pei,et al.  MAPO: mining API usages from open source repositories , 2006, MSR '06.

[38]  Emily Hill,et al.  Identifying Word Relations in Software: A Comparative Study of Semantic Similarity Tools , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[39]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[40]  James Fogarty,et al.  Assieme: finding and leveraging implicit references in a web search interface for programmers , 2007, UIST '07.

[41]  Jian Pei,et al.  MAPO: Mining and Recommending API Usage Patterns , 2009, ECOOP.

[42]  Gabriele Bavota,et al.  Mining StackOverflow to turn the IDE into a self-confident programming prompter , 2014, MSR 2014.

[43]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[44]  Jinqiu Yang,et al.  Inferring semantically related words from software context , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[45]  Mark van den Brand,et al.  Metamodel Clone Detection with SAMOS , 2019, BENEVOL.

[46]  Syed Waqar Jaffry,et al.  Textual keyword extraction and summarization: State-of-the-art , 2019, Inf. Process. Manag..

[47]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[48]  Yuriy Brun,et al.  Repairing Programs with Semantic Code Search (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[49]  Mark Harman,et al.  Automated software transplantation , 2015, ISSTA.

[50]  Frank Maurer,et al.  What makes a good code example?: A study of programming Q&A in StackOverflow , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[51]  Stan Jarzabek,et al.  A Data Mining Approach for Detecting Higher-Level Clones in Software , 2009, IEEE Transactions on Software Engineering.

[52]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[53]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[54]  Ying Zou,et al.  Spotting working code examples , 2014, ICSE.

[55]  Amy J. Ko,et al.  Eliciting design requirements for maintenance-oriented IDEs: a detailed study of corrective and perfective maintenance tasks , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[56]  Xuan Li,et al.  Relationship-aware code search for JavaScript frameworks , 2016, SIGSOFT FSE.

[57]  Howard P. Iker,et al.  An historical note on the use of word-frequency contiguities in content analysis , 1974 .

[58]  Kathryn T. Stolee,et al.  Solving the Search for Source Code , 2014, ACM Trans. Softw. Eng. Methodol..

[59]  Dongmei Zhang,et al.  CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[60]  James Pustejovsky,et al.  Natural Language Annotation for Machine Learning - a Guide to Corpus-Building for Applications , 2012 .

[61]  Jens Krinke,et al.  Siamese: scalable and incremental code clone search via multiple code representations , 2019, Empirical Software Engineering.

[62]  Beijun Shen,et al.  Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[63]  Bela Gipp,et al.  Research-paper recommender systems: a literature survey , 2015, International Journal on Digital Libraries.

[64]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[65]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[66]  Martin Dillon,et al.  Introduction to modern information retrieval: G. Salton and M. McGill. McGraw-Hill, New York (1983). xv + 448 pp., $32.95 ISBN 0-07-054484-0 , 1983 .

[67]  Ralph E. Johnson Documenting frameworks using patterns , 1992, OOPSLA.

[68]  Koushik Sen,et al.  Retrieval on source code: a neural code search , 2018, MAPL@PLDI.

[69]  Westley Weimer,et al.  Synthesizing API usage examples , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[70]  Chanchal Kumar Roy,et al.  BigCloneEval: A Clone Detection Tool Evaluation Framework with BigCloneBench , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).