Binary Code Clone Detection across Architectures and Compiling Configurations

Binary code clone (or similarity) detection is a fundamental technique for many important applications, such as plagiarism detection, malware analysis, software vulnerability assessment and program comprehension. With the prevailing of smart and IoT (Internet of Things) devices, more and more programs are ported from traditional desktop platform (e.g., IA-32) to ARM and MIPS architectures. It is imperative to detect cloned binary code across architectures. However, because of incomparable instruction sets of different architectures as well as alternative compiling configurations of binaries, it is difficult to conduct a binary code clone detection with traditional syntax-or structure-based methods. To address, we propose a semantics-based approach to fulfill the target. We recognize arguments and indirect jump targets of each binary function, and emulate executions of those functions to extract semantic signatures helping measure the similarity of functions. The approach has been implemented in a prototype system names CACompare to detect cloned binary functions across architectures and compiling configurations. It supports comparisons between mainstream architectures (IA-32, ARM and MIPS) and is able to analysis binaries on Linux platform. The experimental results show that CACompare not only is effective in dealing with binaries of different architectures and variant compiling configurations, but also improves the accuracy of binary code clone detection comparing to state-of-the-art solutions.

[1]  Juanru Li,et al.  Cross-Architecture Binary Semantics Understanding via Similar Code Comparison , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[2]  Stefano Zanero,et al.  Lines of malicious code: insights into the malicious software industry , 2012, ACSAC '12.

[3]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[4]  Saumya K. Debray,et al.  Deobfuscation: reverse engineering obfuscated code , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[5]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[6]  Fangfang Zhang,et al.  A first step towards algorithm plagiarism detection , 2012, ISSTA 2012.

[7]  Eran Yahav,et al.  Statistical similarity of binaries , 2016, PLDI.

[8]  Xi Chen,et al.  An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries , 2016, USENIX Security Symposium.

[9]  Sencun Zhu,et al.  Behavior based software theft detection , 2009, CCS.

[10]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[11]  Fangfang Zhang,et al.  Program Logic Based Software Plagiarism Detection , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[12]  Sencun Zhu,et al.  Value-based program characterization and its application to software plagiarism detection , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[13]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[14]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[15]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[16]  Dinghao Wu,et al.  Reassembleable Disassembling , 2015, USENIX Security Symposium.

[17]  Frank Mueller,et al.  Languages and Compilers for Parallel Computing , 2015, Lecture Notes in Computer Science.

[18]  Thomas W. Reps,et al.  DIVINE: DIscovering Variables IN Executables , 2007, VMCAI.

[19]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[20]  Andrew Walenstein,et al.  The Software Similarity Problem in Malware Analysis , 2006, Duplication, Redundancy, and Similarity in Software.

[21]  Sencun Zhu,et al.  Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection , 2014, SIGSOFT FSE.

[22]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[23]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[24]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[25]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[26]  Stephen McCamant,et al.  Binary Code Extraction and Interface Identification for Security Applications , 2009, NDSS.

[27]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[28]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[29]  David Brumley,et al.  TIE: Principled Reverse Engineering of Types in Binary Programs , 2011, NDSS.

[30]  Rajeev Barua,et al.  Scalable variable and data type detection in a binary rewriter , 2013, PLDI.

[31]  Herbert Bos,et al.  VUzzer: Application-aware Evolutionary Fuzzing , 2017, NDSS.