BinMatch: A Semantics-Based Hybrid Approach on Binary Code Clone Analysis

Binary code clone analysis is an important technique which has a wide range of applications in software engineering (e.g., plagiarism detection, bug detection). The main challenge of the topic lies in the semantics-equivalent code transformation (e.g., optimization, obfuscation) which would alter representations of binary code tremendously. Another challenge is the trade-off between detection accuracy and coverage. Unfortunately, existing techniques still rely on semantics-less code features which are susceptible to the code transformation. Besides, they adopt merely either a static or a dynamic approach to detect binary code clones, which cannot achieve high accuracy and coverage simultaneously.  In this paper, we propose a semantics-based hybrid approach to detect binary clone functions. We execute a template binary function with its test cases, and emulate the execution of every target function for clone comparison with the runtime information migrated from that template function. The semantic signatures are extracted during the execution of the template function and emulation of the target function. Lastly, a similarity score is calculated from their signatures to measure their likeness. We implement the approach in a prototype system designated as BinMatch which analyzes IA-32 binary code on the Linux platform. We evaluate BinMatch with eight real-world projects compiled with different compilation configurations and commonly-used obfuscation methods, totally performing over 100 million pairs of function comparison. The experimental results show that BinMatch is robust to the semantics-equivalent code transformation. Besides, it not only covers all target functions for clone analysis, but also improves the detection accuracy comparing to the state-of-the-art solutions.

[1]  Pascal Junod,et al.  Obfuscator-LLVM -- Software Protection for the Masses , 2015, 2015 IEEE/ACM 1st International Workshop on Software Protection.

[2]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[3]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[4]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[5]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[6]  Xi Chen,et al.  An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries , 2016, USENIX Security Symposium.

[7]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[8]  Sencun Zhu,et al.  Value-based program characterization and its application to software plagiarism detection , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[9]  Juanru Li,et al.  Cross-Architecture Binary Semantics Understanding via Similar Code Comparison , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[10]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[11]  Ping Li,et al.  In Defense of Minhash over Simhash , 2014, AISTATS.

[12]  Benjamin C. M. Fung,et al.  Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering , 2016, KDD.

[13]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[14]  Eran Yahav,et al.  Statistical similarity of binaries , 2016, PLDI.

[15]  Jiang Ming,et al.  BinSim: Trace-based Semantic Binary Diffing via System Call Sliced Segment Equivalence Checking , 2017, USENIX Security Symposium.

[16]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[17]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[18]  T. Laszlo,et al.  OBFUSCATING C++ PROGRAMS VIA CONTROL FLOW FLATTENING , 2009 .

[19]  Andrew Walenstein,et al.  The Software Similarity Problem in Malware Analysis , 2006, Duplication, Redundancy, and Similarity in Software.

[20]  Minkyu Jung,et al.  Testing intermediate representations for binary analysis , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21]  Saumya Debray,et al.  A Generic Approach to Automatic Deobfuscation of Executable Code , 2015, 2015 IEEE Symposium on Security and Privacy.

[22]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[23]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[24]  Saumya K. Debray,et al.  Deobfuscation: reverse engineering obfuscated code , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[25]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[26]  Lu Zhang,et al.  Can I clone this piece of code here? , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[27]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.

[28]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[29]  David Brumley,et al.  Automatic Patch-Based Exploit Generation is Possible: Techniques and Implications , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[30]  Saumya Debray,et al.  Symbolic Execution of Obfuscated Code , 2015, CCS.

[31]  Karl Trygve Kalleberg,et al.  Finding software license violations through binary code clone detection , 2011, MSR '11.

[32]  Fangfang Zhang,et al.  Program Logic Based Software Plagiarism Detection , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[33]  Sencun Zhu,et al.  Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection , 2014, SIGSOFT FSE.

[34]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[35]  Ronald Rousseau,et al.  Similarity measures in scientometric research: The Jaccard index versus Salton's cosine formula , 1989, Inf. Process. Manag..

[36]  Juanru Li,et al.  Binary Code Clone Detection across Architectures and Compiling Configurations , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[37]  Nahid Shahmehri,et al.  Turning programs against each other: high coverage fuzz-testing using binary-code mutation and dynamic slicing , 2015, ESEC/SIGSOFT FSE.

[38]  Dinghao Wu,et al.  In-memory fuzzing for binary code similarity analysis , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[39]  Sencun Zhu,et al.  Behavior based software theft detection , 2009, CCS.

[40]  Fangfang Zhang,et al.  A first step towards algorithm plagiarism detection , 2012, ISSTA 2012.

[41]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[42]  Stefano Zanero,et al.  Lines of malicious code: insights into the malicious software industry , 2012, ACSAC '12.