Value-based program characterization and its application to software plagiarism detection

Identifying similar or identical code fragments becomes much more challenging in code theft cases where plagiarizers can use various automated code transformation techniques to hide stolen code from being detected. Previous works in this field are largely limited in that (1) most of them cannot handle advanced obfuscation techniques; (2) the methods based on source code analysis are less practical since the source code of suspicious programs is typically not available until strong evidences are collected; and (3) those depending on the features of specific operating systems or programming languages have limited applicability. Based on an observation that some critical runtime values are hard to be replaced or eliminated by semantics-preserving transformation techniques, we introduce a novel approach to dynamic characterization of executable programs. Leveraging such invariant values, our technique is resilient to various control and data obfuscation techniques. We show how the values can be extracted and refined to expose the critical values and how we can apply this runtime property to help solve problems in software plagiarism detection. We have implemented a prototype with a dynamic taint analyzer atop a generic processor emulator. Our experimental results show that the value-based method successfully discriminates 34 plagiarisms obfuscated by SandMark, plagiarisms heavily obfuscated by KlassMaster, programs obfuscated by Thicket, and executables obfuscated by Loco/Diablo.

[1]  James Newsome,et al.  Dynamic Taint Analysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Commodity Software , 2005, NDSS.

[2]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[3]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[4]  Christian S. Collberg,et al.  A Taxonomy of Obfuscating Transformations , 1997 .

[5]  Akito Monden,et al.  Design and evaluation of dynamic software birthmarks based on API calls , 2007 .

[6]  R. Sekar,et al.  On the Limits of Information Flow Techniques for Malware Analysis and Containment , 2008, DIMVA.

[7]  Paul Roe,et al.  Static Analysis of Students' Java Programs , 2004, ACE.

[8]  Hwan-Gue Cho,et al.  A source code linearization technique for detecting plagiarized programs , 2007, ITiCSE.

[9]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[10]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[11]  Heng Yin,et al.  Dynamic Spyware Analysis , 2007, USENIX Annual Technical Conference.

[12]  Zhendong Su,et al.  Context-based detection of clone-related bugs , 2007, ESEC-FSE '07.

[13]  Stephen Drape,et al.  Slicing Aided Design of Obfuscating Transforms , 2007, 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007).

[14]  John C. Knight,et al.  A security architecture for survivability mechanisms , 2001 .

[15]  Akito Monden,et al.  Design and evaluation of birthmarks for detecting theft of java programs , 2004, IASTED Conf. on Software Engineering.

[16]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[17]  Clark Thomborson,et al.  Manufacturing cheap, resilient, and stealthy opaque constructs , 1998, POPL '98.

[18]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[19]  Jaeyoung Choi,et al.  A Program Plagiarism Evaluation System , 2005, ICCSA.

[20]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[21]  David Schuler,et al.  A dynamic birthmark for java , 2007, ASE.

[22]  Christian S. Collberg,et al.  K-gram based software birthmarks , 2005, SAC '05.

[23]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[24]  Kostas Kontogiannis,et al.  Detecting Code Similarity Using Patterns , 1995 .

[25]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[26]  Christopher W. Pidgeon,et al.  DMS®: Program Transformations for Practical Scalable Software Evolution , 2002, IWPSE '02.

[27]  Sencun Zhu,et al.  Behavior based software theft detection , 2009, CCS.

[28]  Akito Monden,et al.  Dynamic Software Birthmarks to Detect the Theft of Windows Applications , 2004 .

[29]  Koen De Bosschere,et al.  LOCO: an interactive code (De)obfuscation tool , 2006, PEPM '06.

[30]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[31]  Christian S. Collberg,et al.  Detecting Software Theft via Whole Program Path Birthmarks , 2004, ISC.

[32]  Christian S. Collberg,et al.  Sandmark--A Tool for Software Protection Research , 2003, IEEE Secur. Priv..

[33]  Elliot Berk,et al.  JLex: A lexical analyzer generator for Java , 2004 .