A Semantics-Based Hybrid Approach on Binary Code Similarity Comparison

Binary code similarity comparison is a methodology for identifying similar or identical code fragments in binary programs. It is indispensable in fields of software engineering and security, which has many important applications (e.g., plagiarism detection, bug detection). With the widespread of smart and IoT (Internet of Things) devices, an increasing number of programs are ported to multiple architectures (e.g. ARM, MIPS). It becomes necessary to detect similar binary code across architectures as well. The main challenge of this topic lies in the semantics-equivalent code transformation resulting from different compilation settings, code obfuscation, and varied instruction set architectures. Another challenge is the trade-off between comparison accuracy and coverage. Unfortunately, existing methods still heavily rely on semantics-less code features which are susceptible to the code transformation. Additionally, they perform the comparison merely either in a static or in a dynamic manner, which cannot achieve high accuracy and coverage simultaneously. In this paper, we propose a semantics-based hybrid method to compare binary function similarity. We execute the reference function with test cases, then emulate the execution of every target function with the runtime information migrated from the reference function. Semantic signatures are extracted during the execution as well as the emulation. Lastly, similarity scores are calculated from the signatures to measure the likeness of functions. We have implemented the method in a prototype system designated as BinMatch and evaluate it with nine real-word projects compiled with different compilation settings, on variant architectures, and with commonly-used obfuscation methods, totally performing over 100 million pairs of function comparison.

[1]  Xi Chen,et al.  An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries , 2016, USENIX Security Symposium.

[2]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[3]  Eran Yahav,et al.  FirmUp: Precise Static Detection of Common Vulnerabilities in Firmware , 2018, ASPLOS.

[4]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[5]  Arini Balakrishnan,et al.  Code Obfuscation Literature Survey , 2005 .

[6]  Stefano Zanero,et al.  Lines of malicious code: insights into the malicious software industry , 2012, ACSAC '12.

[7]  Sencun Zhu,et al.  Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection , 2017, IEEE Transactions on Software Engineering.

[8]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[9]  Minkyu Jung,et al.  Testing intermediate representations for binary analysis , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[10]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[11]  Christian Rossow,et al.  Leveraging semantic signatures for bug search in binary programs , 2014, ACSAC.

[12]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[13]  Juanru Li,et al.  BinMatch: A Semantics-Based Hybrid Approach on Binary Code Clone Analysis , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[14]  Herbert Bos,et al.  VUzzer: Application-aware Evolutionary Fuzzing , 2017, NDSS.

[15]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[16]  Yu Fu,et al.  VMHunt: A Verifiable Approach to Partially-Virtualized Binary Code Simplification , 2018, CCS.

[17]  Amr M. Youssef,et al.  BinSequence: Fast, Accurate and Scalable Binary Code Reuse Detection , 2017, AsiaCCS.

[18]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[19]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[20]  Sencun Zhu,et al.  Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection , 2014, SIGSOFT FSE.

[21]  Pascal Junod,et al.  Obfuscator-LLVM -- Software Protection for the Masses , 2015, 2015 IEEE/ACM 1st International Workshop on Software Protection.

[22]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[23]  Fangfang Zhang,et al.  A first step towards algorithm plagiarism detection , 2012, ISSTA 2012.

[24]  Barton P. Miller,et al.  Binary code is not easy , 2016, ISSTA.

[25]  Robert Harper,et al.  Practical Foundations for Programming Languages (2nd. Ed.) , 2016 .

[26]  T. Laszlo,et al.  OBFUSCATING C++ PROGRAMS VIA CONTROL FLOW FLATTENING , 2009 .

[27]  Ronald Rousseau,et al.  Similarity measures in scientometric research: The Jaccard index versus Salton's cosine formula , 1989, Inf. Process. Manag..

[28]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[29]  David Brumley,et al.  Automatic Patch-Based Exploit Generation is Possible: Techniques and Implications , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[30]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.

[31]  Lu Zhang,et al.  Can I clone this piece of code here? , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[32]  Robert Harper,et al.  Practical Foundations for Programming Languages , 2012 .

[33]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX Annual Technical Conference, FREENIX Track.

[34]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[35]  Sencun Zhu,et al.  Value-based program characterization and its application to software plagiarism detection , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[36]  Chao Zhang,et al.  $\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[37]  Fangfang Zhang,et al.  Program Logic Based Software Plagiarism Detection , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[38]  Juanru Li,et al.  Binary Code Clone Detection across Architectures and Compiling Configurations , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[39]  Mu Zhang,et al.  Extracting Conditional Formulas for Cross-Platform Bug Search , 2017, AsiaCCS.

[40]  Dinghao Wu,et al.  In-memory fuzzing for binary code similarity analysis , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[41]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[42]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[43]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[44]  Juanru Li,et al.  Cross-Architecture Binary Semantics Understanding via Similar Code Comparison , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[45]  Nahid Shahmehri,et al.  Turning programs against each other: high coverage fuzz-testing using binary-code mutation and dynamic slicing , 2015, ESEC/SIGSOFT FSE.

[46]  Xiaopeng Li,et al.  Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs , 2018, NDSS.

[47]  Saumya Debray,et al.  Symbolic Execution of Obfuscated Code , 2015, CCS.

[48]  Benjamin C. M. Fung,et al.  Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering , 2016, KDD.

[49]  Saumya K. Debray,et al.  Deobfuscation: reverse engineering obfuscated code , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[50]  Wenke Lee,et al.  Automating Patching of Vulnerable Open-Source Software Versions in Application Binaries , 2019, NDSS.

[51]  Jiang Ming,et al.  BinSim: Trace-based Semantic Binary Diffing via System Call Sliced Segment Equivalence Checking , 2017, USENIX Security Symposium.

[52]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[53]  Karl Trygve Kalleberg,et al.  Finding software license violations through binary code clone detection , 2011, MSR '11.

[54]  Sencun Zhu,et al.  Behavior based software theft detection , 2009, CCS.

[55]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[56]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[57]  Eran Yahav,et al.  Statistical similarity of binaries , 2016, PLDI.

[58]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[59]  Saumya Debray,et al.  A Generic Approach to Automatic Deobfuscation of Executable Code , 2015, 2015 IEEE Symposium on Security and Privacy.

[60]  Eran Yahav,et al.  Similarity of binaries through re-optimization , 2017, PLDI.

[61]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[62]  Lingyu Wang,et al.  BINARM: Scalable and Efficient Detection of Vulnerabilities in Firmware Images of Intelligent Electronic Devices , 2018, DIMVA.

[63]  Hang Zhang,et al.  Precise and Accurate Patch Presence Test for Binaries , 2018, USENIX Security Symposium.

[64]  Scott A. Mahlke,et al.  Profile‐guided automatic inline expansion for C programs , 1992, Softw. Pract. Exp..

[65]  Andrew Walenstein,et al.  The Software Similarity Problem in Malware Analysis , 2006, Duplication, Redundancy, and Similarity in Software.