Source Forager: A Search Engine for Similar Source Code

Developers spend a significant amount of time searching for code: e.g., to understand how to complete, correct, or adapt their own code for a new context. Unfortunately, the state of the art in code search has not evolved much beyond text search over tokenized source. Code has much richer structure and semantics than normal text, and this property can be exploited to specialize the code-search process for better querying, searching, and ranking of code-search results. We present a new code-search engine named Source Forager. Given a query in the form of a C/C++ function, Source Forager searches a pre-populated code database for similar C/C++ functions. Source Forager preprocesses the database to extract a variety of simple code features that capture different aspects of code. A search returns the $k$ functions in the database that are most similar to the query, based on the various extracted code features. We tested the usefulness of Source Forager using a variety of code-search queries from two domains. Our experiments show that the ranked results returned by Source Forager are accurate, and that query-relevant functions can be reliably retrieved even when searching through a large code database that contains very few query-relevant functions. We believe that Source Forager is a first step towards much-needed tools that provide a better code-search experience.

[1]  Sushil Krishna Bajracharya,et al.  CodeGenie: using test-cases to search and reuse source code , 2007, ASE '07.

[2]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[3]  Kathryn T. Stolee,et al.  Repairing Programs with Semantic Code Search , 2015 .

[4]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[5]  Barton P. Miller,et al.  Who Wrote This Code? Identifying the Authors of Program Binaries , 2011, ESORICS.

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[8]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[9]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[10]  Andrew Begel Codifier: A Programmer-Centric Search User Interface , 2008 .

[11]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[12]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[13]  Sushil Krishna Bajracharya,et al.  Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[14]  Sushil Krishna Bajracharya,et al.  A test-driven approach to code search and its application to the reuse of auxiliary functionality , 2011, Inf. Softw. Technol..

[15]  Tao Xie,et al.  Parseweb: a programmer assistant for reusing open source code on the web , 2007, ASE.

[16]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[17]  R. Holmes,et al.  Using structural context to recommend source code examples , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[18]  Steven P. Reiss,et al.  Semantics-based code search , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[19]  Kajal T. Claypool,et al.  XSnippet: mining For sample code , 2006, OOPSLA '06.

[20]  Kathryn T. Stolee,et al.  How developers search for code: a case study , 2015, ESEC/SIGSOFT FSE.

[21]  Dawn J Lawrie,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR EXTRACTING CONCEPT ABBREVIATIONS FROM IDENTIFIERS , 2006 .

[22]  Kathryn T. Stolee,et al.  Solving the Search for Source Code , 2014, ACM Trans. Softw. Eng. Methodol..

[23]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.