Binary Code Reuse Detection for Reverse Engineering and Malware Analysis

Code reuse detection is a key technique in reverse engineering. However, existing source code similarity comparison techniques are not applicable to binary code. Moreover, compilers have made this problem even more difficult due to the fact that different assembly code and control flow structures can be generated by the compilers even when implementing the same functionality. To address this problem, we present a fuzzy matching approach to compare two functions. We first obtain our initial mapping between basic blocks by leveraging the concept of longest common subsequence on the basic block level and execution path level. Then, we extend the achieved mapping using neighborhood exploration. To make our approach applicable to large data sets, we designed an effective filtering process using Minhashing and locality-sensitive hashing. Based on the approach proposed in this thesis, we implemented a tool named BinSequence. We conducted extensive experiments to test BinSequence in terms of performance, accuracy, and scalability. Our results suggest that, given a large assembly code repository with millions of functions, BinSequence is efficient and can attain high quality similarity ranking of assembly functions with an accuracy above 90% within seconds. We also present several practical use cases including patch analysis, malware analysis, and bug search. In the use case for patch analysis, we utilized BinSequence to compare the unpatched and patched versions of the same binary, to reveal the vulnerability information and the details of the patch. For this use case, a Windows system driver (HTTP.sys) which contains a recently published critical vulnerability is used. For the malware analysis use case, we utilized BinSequence to identify reused components or already analyzed parts in malware so that the human analyst can focus on those new functionality to save time and effort. In this use case, two infamous malware, Zeus and Citadel, are analyzed. Finally, in the bug search use case, we utilized BinSequence to identify vulnerable functions in software caused by copying and pasting or sharing buggy source code. In this case, we succeeded in using BinSequence to identify a bug from Firefox. Together, these use cases demonstrate that our tool is both efficient and effective when applied to real-world scenarios. We also compared BinSequence with three state of the art tools: Diaphora, PatchDiff2 and BinDiff. Experiment results show that BinSequence can achieve the best accuracy when compared with these tools.

[1]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[2]  Christopher Krügel,et al.  Identifying Dormant Functionality in Malware Programs , 2010, 2010 IEEE Symposium on Security and Privacy.

[3]  Christian S. Collberg,et al.  K-gram based software birthmarks , 2005, SAC '05.

[4]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[5]  Debin Gao,et al.  BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.

[6]  Thomas Dullien,et al.  Automated Attacker Correlation for Malicious Code , 2010 .

[7]  David Brumley,et al.  BitShred: Fast, Scalable Code Reuse Detection in Binary Code (CMU-CyLab-10-006) , 2007 .

[8]  William F. Smyth,et al.  Efficient token based clone detection with flexible tokenization , 2007, ESEC-FSE companion '07.

[9]  Benjamin C. M. Fung,et al.  BinClone: Detecting Code Clones in Malware , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[10]  Sallie M. Henry,et al.  Software Structure Metrics Based on Information Flow , 1981, IEEE Transactions on Software Engineering.

[11]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[12]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[13]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[14]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[15]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.

[16]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[17]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[18]  Atul Prakash,et al.  Expose: Discovering Potential Binary Code Re-use , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[19]  Sencun Zhu,et al.  Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection , 2014, SIGSOFT FSE.

[20]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[21]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[22]  Mario Vento,et al.  An Improved Algorithm for Matching Large Graphs , 2001 .

[23]  Renato De Mori,et al.  Pattern matching for clone and concept detection , 2004, Automated Software Engineering.

[24]  Srinivas Mukkamala,et al.  Malware detection using assembly and API call sequences , 2011, Journal in Computer Virology.

[25]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[26]  Brendan D. McKay,et al.  Practical graph isomorphism, II , 2013, J. Symb. Comput..

[27]  Mattia Monga,et al.  Detecting Self-mutating Malware Using Control-Flow Graph Matching , 2006, DIMVA.

[28]  László Babai,et al.  Canonical labeling of graphs , 1983, STOC.

[29]  Kim Henrick,et al.  Common subgraph isomorphism detection by backtracking search , 2004, Softw. Pract. Exp..

[30]  Eldad Eilam,et al.  Reversing: Secrets of Reverse Engineering , 2005 .

[31]  Richard Chbeir,et al.  Efficient XML Structural Similarity Detection using Sub-tree Commonalities , 2007, SBBD.

[32]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[33]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[34]  Mourad Debbabi,et al.  On the Reverse Engineering of the Citadel Botnet , 2014, FPS.

[35]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[36]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[37]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[38]  Christian Rossow,et al.  Leveraging semantic signatures for bug search in binary programs , 2014, ACSAC.

[39]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[40]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[41]  John E. Gaffney,et al.  Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation , 1983, IEEE Transactions on Software Engineering.

[42]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[43]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[44]  Richard M. Karp,et al.  Combinatorics, complexity, and randomness , 1986, CACM.

[45]  James C. King,et al.  Symbolic execution and program testing , 1976, CACM.

[46]  J. Howard Johnson,et al.  Substring matching for clone detection and change tracking , 1994, Proceedings 1994 International Conference on Software Maintenance.

[47]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[48]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[49]  Mattia Monga,et al.  Using Code Normalization for Fighting Self-Mutating Malware , 2006, ISSSE.

[50]  Yoann Guillot,et al.  Automatic binary deobfuscation , 2009, Journal in Computer Virology.

[51]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[52]  Zheng Wang,et al.  BMAT - A Binary Matching Tool for Stale Profile Propagation , 2000, J. Instr. Level Parallelism.

[53]  Thomas Dullien,et al.  Graph-based comparison of Executable Objects , 2005 .

[54]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[55]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[56]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[57]  Andrew Walenstein,et al.  Malware phylogeny generation using permutations of code , 2005, Journal in Computer Virology.

[58]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..