Using web corpus statistics for program analysis

Several program analysis tools - such as plagiarism detection and bug finding - rely on knowing a piece of code's relative semantic importance. For example, a plagiarism detector should not bother reporting two programs that have an identical simple loop counter test, but should report programs that share more distinctive code. Traditional program analysis techniques (e.g., finding data and control dependencies) are useful, but do not say how surprising or common a line of code is. Natural language processing researchers have encountered a similar problem and addressed it using an n-gram model of text frequency, derived from statistics computed over text corpora. We propose and compute an n-gram model for programming languages, computed over a corpus of 2.8 million JavaScript programs we downloaded from the Web. In contrast to previous techniques, we describe a code n-gram as a subgraph of the program dependence graph that contains all nodes and edges reachable in n steps from the statement. We can count n-grams in a program and count the frequency of n-grams in the corpus, enabling us to compute tf-idf-style measures that capture the differing importance of different lines of code. We demonstrate the power of this approach by implementing a plagiarism detector with accuracy that beats previous techniques, and a bug-finding tool that discovered over a dozen previously unknown bugs in a collection of real deployed programs.

[1]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[2]  W. Richards Adrion Proceedings of the 19th international conference on Software engineering , 1997, ICSE 1997.

[3]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[4]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[5]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[6]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[7]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[8]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[9]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[10]  Kamalakar Karlapalem,et al.  MARGIN: Maximal Frequent Subgraph Mining , 2006, ICDM.

[11]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[12]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[13]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[14]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[15]  Christopher W. Fraser,et al.  Clone detection via structural abstraction , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[16]  Wei Wang,et al.  Graph Database Indexing Using Structured Graph Decomposition , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[18]  Lei Zou,et al.  A novel spectral coding in a large graph database , 2008, EDBT '08.

[19]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[20]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[21]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[22]  Diana Inkpen,et al.  Real-word spelling correction using Google web 1Tn-gram data set , 2009, CIKM.

[23]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[24]  M. Ravallion The Two Poverty Enlightenments: Historical Insights from Digitized Books Spanning Three Centuries , 2011 .

[25]  Benjamin P. Wood,et al.  Isolating and understanding concurrency errors using reconstructed execution fragments , 2011, PLDI '11.

[26]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[27]  Miryung Kim,et al.  An Empirical Study of Long-Lived Code Clones , 2011, FASE.

[28]  A. Rainer,et al.  Same Difference: Detecting Collusion by Finding Unusual Shared Elements , 2012 .

[29]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[30]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[31]  Anh Tuan Nguyen,et al.  A statistical semantic language model for source code , 2013, ESEC/FSE 2013.

[32]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.