Detecting source code similarity using code abstraction

Various approaches have been proposed to develop effective methods to measure program similarity. Even commercial tools and freeware tools are available for measuring program similarity based on source code comparison. These tools are quite useful to handle small to middle scale software products, but limited for large scale software products. In addition, these tools may report similarity measures with less credentials for the source code either obfuscated by malicious users or generated by automatic program template generation tools. To handle large scale software, more drastic measures should be provided. In this paper, we propose an automatic abstraction method to summarize source code. We eliminate a large portion of source code which is less relevant to similarity comparison. With this abstraction, our similarity comparison method can provide more robust measures for obfuscation and automatic code generation. We evaluate our abstraction method by running through source comparison tool --- MOSS, a web-based similarity detection tool. According to our experiment with multiple versions of Apache HTTP server, Putty SSH client, and Lighttpd server, our abstraction method reports quite reliable results with abstracted source code, which are only 23--35% of original source code. As the execution time for pattern match is linearly proportional to the length of the source code, our method can reduce the execution time as much as the percentage of source code reduction.

[1]  Oscar Nierstrasz,et al.  On the effectiveness of clone detection by string matching , 2006, J. Softw. Maintenance Res. Pract..

[2]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[3]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[4]  Rachel Harrison,et al.  Evolution in software systems: foundations of the SPE classification scheme: Research Articles , 2006 .

[5]  Mohammad El-Ramly,et al.  Similarity in Programs , 2006, Duplication, Redundancy, and Similarity in Software.

[6]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[7]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[8]  Rudolf Eigenmann,et al.  Experiences in Using Cetus for Source-to-Source Transformations , 2004, LCPC.

[9]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[10]  Christopher W. Fraser,et al.  Clone detection via structural abstraction , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[11]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[12]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[13]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[14]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[15]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[16]  Massimiliano Di Penta,et al.  An approach to identify duplicated web pages , 2002, Proceedings 26th Annual International Computer Software and Applications.

[17]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[18]  Rudolf Eigenmann,et al.  Cetus: A Source-to-Source Compiler Infrastructure for Multicores , 2009, Computer.

[19]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[20]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[21]  Jürgen Wolff von Gudenberg,et al.  Clone detection in source code by frequent itemset techniques , 2004, Source Code Analysis and Manipulation, Fourth IEEE International Workshop on.

[22]  Renato De Mori,et al.  Pattern matching for clone and concept detection , 2004, Automated Software Engineering.

[23]  Filippo Lanubile,et al.  Finding function clones in Web applications , 2003, Seventh European Conference onSoftware Maintenance and Reengineering, 2003. Proceedings..

[24]  Edv Informatik Apache HTTP Server , 2010 .

[25]  Terence Parr,et al.  LL(*): the foundation of the ANTLR parser generator , 2011, PLDI '11.