discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code

The identification of security-critical vulnerabilities is a key for protecting computer systems. Being able to perform this process at the binary level is very important given that many software projects are closed-source. Even if the source code is available, compilation may create a mismatch between the source code and the binary code that is executed by the processor, causing analyses that are performed on source code to fail at detecting certain bugs and thus potential vulnerabilities. Existing approaches to find bugs in binary code 1) use dynamic analysis, which is difficult for firmware; 2) handle only a single architecture; or 3) use semantic similarity, which is very slow when analyzing large code bases. In this paper, we present a new approach to efficiently search for similar functions in binary code. We use this method to identify known bugs in binaries as follows: starting with a vulnerable binary function, we identify similar functions in other binaries across different compilers, optimization levels, operating systems, and CPU architectures. The main idea is to compute similarity between functions based on the structure of the corresponding control flow graphs. To minimize this costly computation, we employ an efficient pre-filter based on numeric features to quickly identify a small set of candidate functions. This allows us to efficiently search for similar functions in large code bases. We have designed and implemented a prototype of our approach, called discovRE, that supports four instruction set architectures (x86, x64, ARM, MIPS). We show that discovRE is four orders of magnitude faster than the state-of-the-art academic approach for cross-architecture bug search in binaries. We also show that we can identify Heartbleed and POODLE vulnerabilities in an Android system image that contains over 130,000 native ARM functions in about 80 milliseconds.

[1]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[2]  Kang G. Shin,et al.  Large-scale malware indexing using function-call graphs , 2009, CCS.

[3]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[4]  Richard Lippmann,et al.  Testing static analysis tools using exploitable buffer overflows from open source code , 2004, SIGSOFT '04/FSE-12.

[5]  Christian S. Collberg,et al.  K-gram based software birthmarks , 2005, SAC '05.

[6]  R. Marimont,et al.  Nearest Neighbour Searches and the Curse of Dimensionality , 1979 .

[7]  David Brumley,et al.  AEG: Automatic Exploit Generation , 2011, NDSS.

[8]  Christian Rossow,et al.  Leveraging semantic signatures for bug search in binary programs , 2014, ACSAC.

[9]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[10]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[11]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[12]  Xi Wang,et al.  Improving Integer Security for Systems with KINT , 2012, OSDI.

[13]  Arun Lakhotia,et al.  Fast location of similar code fragments using semantic 'juice' , 2013, PPREW '13.

[14]  Thierry Lavoie,et al.  Uncovering access control weaknesses and flaws with security-discordant software clones , 2013, ACSAC.

[15]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[16]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[17]  Atul Prakash,et al.  Expose: Discovering Potential Binary Code Re-use , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[18]  Christopher Krügel,et al.  Pixy: a static analysis tool for detecting Web application vulnerabilities , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).

[19]  Kevin Coogan,et al.  Deobfuscation of virtualization-obfuscated software: a semantics-based approach , 2011, CCS '11.

[20]  Saumya Debray,et al.  A Generic Approach to Automatic Deobfuscation of Executable Code , 2015, 2015 IEEE Symposium on Security and Privacy.

[21]  Konrad Rieck,et al.  Modeling and Discovering Vulnerabilities with Code Property Graphs , 2014, 2014 IEEE Symposium on Security and Privacy.

[22]  Xiangyu Zhang,et al.  Obfuscation resilient binary code reuse through trace-oriented programming , 2013, CCS.

[23]  Thomas Dullien,et al.  Graph-based comparison of Executable Objects , 2005 .

[24]  Wenke Lee,et al.  Type Casting Verification: Stopping an Emerging Attack Vector , 2015, USENIX Security Symposium.

[25]  David Brumley,et al.  ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions , 2012, 2012 IEEE Symposium on Security and Privacy.

[26]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[27]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[28]  Debin Gao,et al.  iBinHunt: Binary Hunting with Inter-procedural Control Flow , 2012, ICISC.

[29]  Khaled Yakdan,et al.  No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations , 2015, NDSS.

[30]  Thomas W. Reps,et al.  WYSINWYX: What you see is not what you eXecute , 2005, TOPL.

[31]  Priya Narasimhan,et al.  Binary Function Clustering Using Semantic Hashes , 2012, 2012 11th International Conference on Machine Learning and Applications.

[32]  David Brumley,et al.  Unleashing Mayhem on Binary Code , 2012, 2012 IEEE Symposium on Security and Privacy.

[33]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[34]  Sencun Zhu,et al.  Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection , 2014, SIGSOFT FSE.

[35]  J. J. McGregor,et al.  Backtrack search algorithms and the maximal common subgraph problem , 1982, Softw. Pract. Exp..

[36]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[37]  Michael E. Wall,et al.  Galib: a c++ library of genetic algorithm components , 1996 .

[38]  Debin Gao,et al.  BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.

[39]  Konrad Rieck,et al.  Generalized vulnerability extrapolation using abstract syntax trees , 2012, ACSAC '12.

[40]  David Brumley,et al.  RICH: Automatically Protecting Against Integer-Based Vulnerabilities , 2007, NDSS.

[41]  Christopher Krügel,et al.  Static Disassembly of Obfuscated Binaries , 2004, USENIX Security Symposium.

[42]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[43]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[44]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[45]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.