Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. In the case of software plagiarism, emerging obfuscation techniques have made automated detection increasingly difficult. In this paper, we propose a binary-oriented, obfuscation-resilient method based on a new concept, longest common subsequence of semantically equivalent basic blocks, which combines rigorous program semantics with longest common subsequence based fuzzy matching. We model the semantics of a basic block by a set of symbolic formulas representing the input-output relations of the block. This way, the semantics equivalence (and similarity) of two blocks can be checked by a theorem prover. We then model the semantics similarity of two paths using the longest common subsequence with basic blocks as elements. This novel combination has resulted in strong resiliency to code obfuscation. We have developed a prototype and our experimental results show that our method is effective and practical when applied to real-world software.

[1]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[2]  David Schuler,et al.  A dynamic birthmark for java , 2007, ASE.

[3]  Jeffrey C. Lagarias,et al.  The Ultimate Challenge: The 3x+1 Problem , 2011 .

[4]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[5]  Koushik Sen,et al.  DART: directed automated random testing , 2005, PLDI '05.

[6]  Sencun Zhu,et al.  Replacement Attacks on Behavior Based Software Birthmark , 2011, ISC.

[7]  Sencun Zhu,et al.  Detecting Software Theft via System Call Based Birthmarks , 2009, 2009 Annual Computer Security Applications Conference.

[8]  Thomas W. Reps,et al.  WYSINWYX: What you see is not what you eXecute , 2005, TOPL.

[9]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.

[10]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[11]  Akito Monden,et al.  Dynamic Software Birthmarks to Detect the Theft of Windows Applications , 2004 .

[12]  Debin Gao,et al.  BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.

[13]  Sencun Zhu,et al.  Behavior based software theft detection , 2009, CCS.

[14]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[15]  Dawson R. Engler,et al.  EXE: automatically generating inputs of death , 2006, CCS '06.

[16]  Dolores R. Wallace,et al.  Structured Testing: A Testing Methodology Using the Cyclomatic Complexity Metric , 1996 .

[17]  Patrice Godefroid,et al.  Automated Whitebox Fuzz Testing , 2008, NDSS.

[18]  Debin Gao,et al.  iBinHunt: Binary Hunting with Inter-procedural Control Flow , 2012, ICISC.

[19]  Paul Roe,et al.  Static Analysis of Students' Java Programs , 2004, ACE.

[20]  Hwan-Gue Cho,et al.  A source code linearization technique for detecting plagiarized programs , 2007, ITiCSE.

[21]  David L. Dill,et al.  A Decision Procedure for Bit-Vectors and Arrays , 2007, CAV.

[22]  Shuvendu K. Lahiri,et al.  SYMDIFF: A Language-Agnostic Semantic Diff Tool for Imperative Programs , 2012, CAV.

[23]  Patrick Th. Eugster,et al.  Semantics-aware trace analysis , 2009, PLDI '09.

[24]  Christian S. Collberg,et al.  A Taxonomy of Obfuscating Transformations , 1997 .

[25]  George C. Necula,et al.  CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs , 2002, CC.

[26]  Koen De Bosschere,et al.  LOCO: an interactive code (De)obfuscation tool , 2006, PEPM '06.

[27]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[28]  Lutz Prechelt,et al.  JPlag: Finding plagiarisms among a set of programs , 2000 .

[29]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[30]  Fangfang Zhang,et al.  A first step towards algorithm plagiarism detection , 2012, ISSTA 2012.

[31]  T J. Mccabe,et al.  Structured Testing: A Software Testing Methodology Using the Cyclomatic Complexity Metric , 1982 .

[32]  Sencun Zhu,et al.  Value-based program characterization and its application to software plagiarism detection , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[33]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[34]  Dawson R. Engler,et al.  Execution Generated Test Cases: How to Make Systems Code Crash Itself , 2005, SPIN.

[35]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.