Semantic code search via equational reasoning

We present a new approach to semantic code search based on equational reasoning, and the Yogo tool implementing this approach. Our approach works by considering not only the dataflow graph of a function, but also the dataflow graphs of all equivalent functions reachable via a set of rewrite rules. In doing so, it can recognize an operation even if it uses alternate APIs, is in a different but mathematically-equivalent form, is split apart with temporary variables, or is interleaved with other code. Furthermore, it can recognize when code is an instance of some higher-level concept such as iterating through a file. Because of this, from a single query, Yogo can find equivalent code in multiple languages. Our evaluation further shows the utility of Yogo beyond code search: encoding a buggy pattern as a Yogo query, we found a bug in Oracle’s Graal compiler which had been missed by a hand-written static analyzer designed for that exact kind of bug. Yogo is built on the Cubix multi-language infrastructure, and currently supports Java and Python.

[1]  Elnar Hajiyev,et al.  Improve software quality with SemmleCode: an eclipse plugin for semantic code search , 2007, OOPSLA '07.

[2]  Richard C. Waters,et al.  The Programmer's Apprentice: a research overview , 1988, Computer.

[3]  Manuel V. Hermenegildo,et al.  Semantic code browsing* , 2016, Theory and Practice of Logic Programming.

[4]  David Lo,et al.  Code Search via Topic-Enriched Dependence Graph Matching , 2011, 2011 18th Working Conference on Reverse Engineering.

[5]  Koushik Sen,et al.  When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[6]  L LawallJulia,et al.  A foundation for flow-based program matching , 2009 .

[7]  Atul Prakash,et al.  A Framework for Source Code Search Using Program Patterns , 1994, IEEE Trans. Software Eng..

[8]  Jeffrey Xu Yu,et al.  Matching dependence-related queries in the system dependence graph , 2010, ASE.

[9]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[10]  Monica S. Lam,et al.  Cloning-based context-sensitive pointer alias analysis using binary decision diagrams , 2004, PLDI '04.

[11]  Koushik Sen,et al.  Retrieval on source code: a neural code search , 2018, MAPL@PLDI.

[12]  Kathryn T. Stolee,et al.  Solving the Search for Source Code , 2014, ACM Trans. Softw. Eng. Methodol..

[13]  Hyun-il Lim,et al.  A static API birthmark for Windows binary executables , 2009, J. Syst. Softw..

[14]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[15]  Varot. Premtoon Multi-language code search , 2019 .

[16]  Rahul Venkataramani,et al.  Latent Co-development Analysis Based Semantic Search for Large Code Repositories , 2013, 2013 IEEE International Conference on Software Maintenance.

[17]  Xuan Li,et al.  Relationship-aware code search for JavaScript frameworks , 2016, SIGSOFT FSE.

[18]  Abraham Bernstein,et al.  Detecting similar Java classes using tree algorithms , 2006, MSR '06.

[19]  Charles L. Forgy,et al.  Rete: a fast algorithm for the many pattern/many object pattern match problem , 1991 .

[20]  Soya Park,et al.  Post-literate Programming: Linking Discussion and Code in Software Development Teams , 2018, UIST.

[21]  Cristina V. Lopes,et al.  Oreo: detection of clones in the twilight zone , 2018, ESEC/SIGSOFT FSE.

[22]  Lee Martie,et al.  Understanding the impact of support for iteration on code search , 2017, ESEC/SIGSOFT FSE.

[23]  Minxue Pan,et al.  [Research Paper] Semantics-Based Code Search Using Input/Output Examples , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[24]  Michael Stepp,et al.  Generating compiler optimizations from proofs , 2010, POPL '10.

[25]  Andrew Begel Codifier: A Programmer-Centric Search User Interface , 2008 .

[26]  Armando Solar-Lezama,et al.  DemoMatch: API discovery from demonstrations , 2017, PLDI.

[27]  Kajal T. Claypool,et al.  XSnippet: mining For sample code , 2006, OOPSLA '06.

[28]  Don S. Batory,et al.  Dark Knowledge and Graph Grammars in Automated Software Design , 2013, SLE.

[29]  Kathryn T. Stolee,et al.  How developers search for code: a case study , 2015, ESEC/SIGSOFT FSE.

[30]  Robert J. Hall,et al.  Generalized behavior-based retrieval , 1993, ICSE '93.

[31]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[32]  Jacques Klein,et al.  FaCoY – A Code-to-Code Search Engine , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[33]  David Detlefs,et al.  Simplify: a theorem prover for program checking , 2005, JACM.

[34]  Sushil Krishna Bajracharya,et al.  Leveraging usage similarity for effective retrieval of examples in code repositories , 2010, FSE '10.

[35]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[36]  Thomas W. Reps,et al.  Source Forager: A Search Engine for Similar Source Code , 2017, ArXiv.

[37]  Steven P. Reiss,et al.  Semantics-based code search , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[38]  Susan Horwitz,et al.  Identifying the semantic and textual differences between two versions of a program , 1990, PLDI '90.

[39]  Michael Stepp,et al.  Equality saturation: a new approach to optimization , 2009, POPL '09.

[40]  Shinji Kusumoto,et al.  Code Clone Detection on Specialized PDGs with Heuristics , 2011, 2011 15th European Conference on Software Maintenance and Reengineering.

[41]  Swarat Chaudhuri,et al.  Neural query expansion for code search , 2019, MAPL@PLDI.

[42]  Julia L. Lawall,et al.  SmPL: A Domain-Specific Language for Specifying Collateral Evolutions in Linux Device Drivers , 2006, Electron. Notes Theor. Comput. Sci..

[43]  Ying Zou,et al.  Expanding Queries for Code Search Using Semantically Related API Class-names , 2018, IEEE Transactions on Software Engineering.

[44]  Jian Pei,et al.  MAPO: mining API usages from open source repositories , 2006, MSR '06.

[45]  Ruzica Piskac,et al.  Complete completion using types and weights , 2013, PLDI.

[46]  Armando Solar-Lezama,et al.  One tool, many languages: language-parametric transformation with incremental parametric syntax , 2017, ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity.

[47]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[48]  Andy Podgurski,et al.  Behavior sampling: a technique for automated retrieval of reusable components , 1992, International Conference on Software Engineering.

[49]  Christian Wimmer,et al.  One VM to rule them all , 2013, Onward!.

[50]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[51]  Armando Solar-Lezama,et al.  Data-driven synthesis for object-oriented frameworks , 2011, OOPSLA '11.

[52]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[53]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[54]  Philipp Schügerl,et al.  Scalable clone detection using description logic , 2011, IWSC.

[55]  Rastislav Bodík,et al.  Jungloid mining: helping to navigate the API jungle , 2005, PLDI '05.

[56]  Heejung Kim,et al.  MeCC: memory comparison-based clone detector , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[57]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[58]  Jeannette M. Wing,et al.  Specifications as Search Keys for Software Libraries , 1991, ICLP.

[59]  Cosmin Radoi Toward automatic programming , 2018 .

[60]  Benjamin Livshits,et al.  Finding application errors and security flaws using PQL: a program query language , 2005, OOPSLA '05.

[61]  Miryung Kim,et al.  Lase: Locating and applying systematic edits by learning from examples , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[62]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[63]  Chanchal Kumar Roy,et al.  Evaluating clone detection tools with BigCloneBench , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[64]  Dongmei Zhang,et al.  CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[65]  Damien Doligez,et al.  A foundation for flow-based program matching: using temporal logic and model checking , 2009, POPL '09.

[66]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[67]  R. Holmes,et al.  Using structural context to recommend source code examples , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[68]  Anh Tuan Nguyen,et al.  Graph-Based Statistical Language Model for Code , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[69]  Paliath Narendran,et al.  Matching, unification and complexity , 1987, SIGS.

[70]  Alvin Cheung,et al.  Verified lifting of stencil computations , 2016, PLDI.

[71]  Gaurav Khandelwal,et al.  Bing developer assistant: improving developer productivity by recommending sample code , 2016, SIGSOFT FSE.

[72]  Hung Viet Nguyen,et al.  Graph-based pattern-oriented, context-sensitive source code completion , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[73]  Louis Wasserman Scalable, example-based refactorings with refaster , 2013, WRT '13.

[74]  Richard C. Waters,et al.  The programmer's apprentice , 1990, ACM Press frontier series.

[75]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[76]  Frederico Araújo Durão,et al.  Applying a semantic layer in a source code search tool , 2008, SAC '08.

[77]  Eran Yahav,et al.  Typestate-based semantic code search over partial programs , 2012, OOPSLA '12.

[78]  Nicole Schweikardt,et al.  First-order logic with counting , 2017, 2017 32nd Annual ACM/IEEE Symposium on Logic in Computer Science (LICS).

[79]  Tao Xie,et al.  Parseweb: a programmer assistant for reusing open source code on the web , 2007, ASE.

[80]  Koushik Sen,et al.  SNIFF: A Search Engine for Java Using Free-Form Queries , 2009, FASE.

[81]  Panagiotis Manolios Mechanical verification of reactive systems , 2001 .

[82]  Koushik Sen,et al.  CodeHint: dynamic and interactive synthesis of code snippets , 2014, ICSE.

[83]  Guy Van den Broeck,et al.  Active Inductive Logic Programming for Code Search , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[84]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[85]  Greg Nelson,et al.  Simplification by Cooperating Decision Procedures , 1979, TOPL.

[86]  Jacques Klein,et al.  Augmenting and structuring user queries to support efficient free-form code search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[87]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[88]  Colin Runciman,et al.  Retrieving re-usable software components by polymorphic type , 1989, Journal of Functional Programming.

[89]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[90]  Isil Dillig,et al.  Component-based synthesis for complex APIs , 2017, POPL.

[91]  Martin T. Vechev,et al.  PHOG: Probabilistic Model for Code , 2016, ICML.

[92]  Greg Nelson,et al.  Fast Decision Procedures Based on Congruence Closure , 1980, JACM.

[93]  Mukund Raghothaman,et al.  SWIM: Synthesizing What I Mean - Code Search and Idiomatic Snippet Synthesis , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).