Binary Analysis Overview

When the source code is unavailable, it is important for security applications, such as malware detection, software license infringement, vulnerability analysis, and digital forensics to be able to efficiently extract meaningful fingerprints from the binary code. Such fingerprints will enhance the effectiveness and efficiency of reverse engineering tasks as they can provide a range of insights into the program binaries. However, a great deal of important information will likely be lost during the compilation process, including variable and function names, the original control and data flow structures, comments, and layout. In this chapter, we provide a comprehensive review of existing binary code fingerprinting frameworks. As such, we systematize the study of binary code fingerprints based on the most important dimensions: the applications that motivate it, the approaches used and their implementations, the specific aspects of the fingerprinting framework, and how the results are evaluated.

[1]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[2]  Angelos D. Keromytis,et al.  Retrofitting Security in COTS Software with Binary Rewriting , 2011, SEC.

[3]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[4]  Barton P. Miller,et al.  Hybrid Analysis and Control of Malware , 2010, RAID.

[5]  Martin Schulz,et al.  Stack Trace Analysis for Large Scale Debugging , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[6]  David Brumley,et al.  Automatic exploit generation , 2014, CACM.

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  C. Csallner,et al.  Check 'n' crash: combining static checking and testing , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[9]  Michael Ligh,et al.  Malware Analyst's Cookbook and DVD: Tools and Techniques for Fighting Malicious Code , 2010 .

[10]  Pascal Junod,et al.  Obfuscator-LLVM -- Software Protection for the Masses , 2015, 2015 IEEE/ACM 1st International Workshop on Software Protection.

[11]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[12]  Lior Rokach,et al.  Improving malware detection by applying multi-inducer ensemble , 2009, Comput. Stat. Data Anal..

[13]  Abhik Roychoudhury,et al.  Hercules: Reproducing Crashes in Real-World Application Binaries , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[14]  Christopher Krügel,et al.  AccessMiner: using system-centric models for malware protection , 2010, CCS '10.

[15]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[16]  Meir M. Lehman,et al.  Rules and Tools for Software Evolution Planning and Management , 2001, Ann. Softw. Eng..

[17]  Barton P. Miller,et al.  Detecting Code Reuse Attacks with a Model of Conformant Program Execution , 2014, ESSoS.

[18]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[19]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[20]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[21]  Patrice Godefroid,et al.  SAGE: Whitebox Fuzzing for Security Testing , 2012, ACM Queue.

[22]  Adel Djoudi,et al.  BINSEC: Binary Code Analysis with Low-Level Regions , 2015, TACAS.

[23]  Giovanni Agosta,et al.  rev.ng: a unified binary analysis framework to recover CFGs and function boundaries , 2017, CC.

[24]  Benjamin C. M. Fung,et al.  BinClone: Detecting Code Clones in Malware , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[25]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[26]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[27]  George Candea,et al.  The S2E Platform: Design, Implementation, and Applications , 2012, TOCS.

[28]  Karl J. Ottenstein,et al.  The program dependence graph in a software development environment , 1984 .

[29]  David Notkin,et al.  Symstra: A Framework for Generating Object-Oriented Unit Tests Using Symbolic Execution , 2005, TACAS.

[30]  DavidYaniv,et al.  Tracelet-based code search in executables , 2014 .

[31]  Barton P. Miller,et al.  Labeling library functions in stripped binaries , 2011, PASTE '11.

[32]  Tudor Dumitras,et al.  Toward a standard benchmark for computer security research: the worldwide intelligence network environment (WINE) , 2011, BADGERS '11.

[33]  Andrew Walenstein,et al.  Malware phylogeny generation using permutations of code , 2005, Journal in Computer Virology.

[34]  Parag Agrawal,et al.  On indexing error-tolerant set containment , 2010, SIGMOD Conference.

[35]  David Brumley,et al.  Program-Adaptive Mutational Fuzzing , 2015, 2015 IEEE Symposium on Security and Privacy.

[36]  Barton P. Miller,et al.  Who Wrote This Code? Identifying the Authors of Program Binaries , 2011, ESORICS.

[37]  Muddassar Farooq,et al.  ELF-Miner: using structural knowledge and data mining methods to detect new (Linux) malicious executables , 2011, Knowledge and Information Systems.

[38]  Lingyu Wang,et al.  BinGold: Towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGs) , 2016 .

[39]  Xiangyu Zhang,et al.  Obfuscation resilient binary code reuse through trace-oriented programming , 2013, CCS.

[40]  Joseph Robert Horgan,et al.  Dynamic program slicing , 1990, PLDI '90.

[41]  Koen De Bosschere,et al.  Hybrid static-dynamic attacks against software protection mechanisms , 2005, DRM '05.

[42]  Shuvendu K. Lahiri,et al.  SYMDIFF: A Language-Agnostic Semantic Diff Tool for Imperative Programs , 2012, CAV.

[43]  Thomas W. Reps,et al.  WYSINWYX: What you see is not what you eXecute , 2005, TOPL.

[44]  J OttensteinKarl,et al.  The program dependence graph in a software development environment , 1984 .

[45]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[46]  Jeong-Hoon Lee,et al.  Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases , 2013, SIGMOD '13.

[47]  ZhaoQin,et al.  Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code , 2015 .

[48]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[49]  Barton P. Miller,et al.  Automated tracing and visualization of software security structure and properties , 2012, VizSec '12.

[50]  Michael D. Ernst,et al.  Value dependence graphs: representation without taxation , 1994, POPL '94.

[51]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[52]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[53]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[54]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[55]  D. Massart,et al.  The Mahalanobis distance , 2000 .

[56]  Sencun Zhu,et al.  Value-based program characterization and its application to software plagiarism detection , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[57]  Panos Kalnis,et al.  Efficient and accurate nearest neighbor and closest pair search in high-dimensional space , 2010, TODS.

[58]  Stephen McCamant,et al.  Path-exploration lifting: hi-fi tests for lo-fi emulators , 2012, ASPLOS XVII.

[59]  Paul Barford,et al.  An empirical study of malware evolution , 2009, 2009 First International Communication Systems and Networks and Workshops.

[60]  Eran Yahav,et al.  Statistical similarity of binaries , 2016, PLDI.

[61]  Priya Narasimhan,et al.  Binary Function Clustering Using Semantic Hashes , 2012, 2012 11th International Conference on Machine Learning and Applications.

[62]  Barton P. Miller,et al.  Extracting compiler provenance from program binaries , 2010, PASTE '10.

[63]  Andy King,et al.  BinSlayer: accurate comparison of binary executables , 2013, PPREW '13.

[64]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[65]  Jianzhong Li,et al.  Efficient Subgraph Matching on Billion Node Graphs , 2012, Proc. VLDB Endow..

[66]  Jiyong Jang,et al.  Experimental study of fuzzy hashing in malware clustering analysis , 2015 .

[67]  Johannes Kinder,et al.  Static Analysis of x86 Executables , 2010 .

[68]  David Brumley,et al.  Unleashing Mayhem on Binary Code , 2012, 2012 IEEE Symposium on Security and Privacy.

[69]  AgrawalHiralal,et al.  Dynamic program slicing , 1990 .

[70]  Dawson R. Engler,et al.  Under-Constrained Symbolic Execution: Correctness Checking for Real Code , 2015, USENIX Annual Technical Conference.

[71]  Maarten Van Emmerik Identifying Library Functions in Executable Files Using Patterns , 1998, Australian Software Engineering Conference.

[72]  Sencun Zhu,et al.  Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection , 2014, SIGSOFT FSE.

[73]  Ki Wook Sohn,et al.  Toward extracting malware features for classification using static and dynamic analysis , 2012, 2012 8th International Conference on Computing and Networking Technology (INC, ICCIS and ICMIC).

[74]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[75]  David Brumley,et al.  BAP: A Binary Analysis Platform , 2011, CAV.

[76]  Xiaohong Su,et al.  Library functions identification in binary code by using graph isomorphism testings , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[77]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[78]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[79]  Atul Prakash,et al.  Expose: Discovering Potential Binary Code Re-use , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[80]  Lingyu Wang,et al.  BinShape: Scalable and Robust Binary Library Function Identification Using Function Shape , 2017, DIMVA.

[81]  Xiaohong Su,et al.  Using Reduced Execution Flow Graph to Identify Library Functions in Binary Code , 2016, IEEE Transactions on Software Engineering.

[82]  Fan Long,et al.  Automatic runtime error repair and containment via recovery shepherding , 2014, PLDI.

[83]  David Brumley,et al.  Towards Automatic Software Lineage Inference , 2013, USENIX Security Symposium.

[84]  Cacm Staff,et al.  BufferBloat , 2011, Communications of the ACM.

[85]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[86]  Mattia Monga,et al.  Code Normalization for Self-Mutating Malware , 2007, IEEE Security & Privacy.

[87]  Arun Lakhotia,et al.  Identifying Shared Software Components to Support Malware Forensics , 2014, DIMVA.

[88]  Konrad Rieck,et al.  Automatic Inference of Search Patterns for Taint-Style Vulnerabilities , 2015, 2015 IEEE Symposium on Security and Privacy.

[89]  Benjamin C. M. Fung,et al.  Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering , 2016, KDD.

[90]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[91]  Stephen McCamant,et al.  Binary Code Extraction and Interface Identification for Security Applications , 2009, NDSS.

[92]  Debin Gao,et al.  iBinHunt: Binary Hunting with Inter-procedural Control Flow , 2012, ICISC.

[93]  Sibylle Schupp,et al.  A non-convex abstract domain for the value analysis of binaries , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[94]  Olivier Ly,et al.  The BINCOA Framework for Binary Code Analysis , 2011, CAV.

[95]  Lingyu Wang,et al.  BinComp: A stratified approach to compiler provenance Attribution , 2015, Digit. Investig..

[96]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[97]  Lingyu Wang,et al.  OBA2: An Onion approach to Binary code Authorship Attribution , 2014, Digit. Investig..

[98]  Rajeev Barua,et al.  Scalable variable and data type detection in a binary rewriter , 2013, PLDI.

[99]  Heejo Lee,et al.  BinGraph: Discovering mutant malware using hierarchical semantic signatures , 2012, 2012 7th International Conference on Malicious and Unwanted Software.

[100]  Fetri Reni,et al.  An Analysis of Racial Discriminations as Seen in Kathryn Stokett’s Novel The Help , 2019 .

[101]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[102]  Lingyu Wang,et al.  SIGMA: A Semantic Integrated Graph Matching Approach for identifying reused functions in binary code , 2015, Digit. Investig..

[103]  Wilfred Ng,et al.  Locality-sensitive hashing scheme based on dynamic collision counting , 2012, SIGMOD Conference.

[104]  Felix C. Freiling,et al.  Kernel mode API spectroscopy for incident response and digital forensics , 2013, PPREW '13.

[105]  R. Nigel Horspool,et al.  MARD: A Framework for Metamorphic Malware Analysis and Real-Time Detection , 2014, AINA.

[106]  C NeculaGeorge,et al.  Precise interprocedural analysis using random interpretation , 2005 .

[107]  Meir M. Lehman,et al.  A Model of Large Program Development , 1976, IBM Syst. J..

[108]  Zheng Wang,et al.  BMAT - A Binary Matching Tool for Stale Profile Propagation , 2000, J. Instr. Level Parallelism.

[109]  David Brumley,et al.  ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions , 2012, 2012 IEEE Symposium on Security and Privacy.

[110]  Jean-Yves Marion,et al.  Aligot: cryptographic function identification in obfuscated binary programs , 2012, CCS.

[111]  Sumit Gulwani,et al.  Precise interprocedural analysis using random interpretation , 2005, POPL '05.

[112]  Sencun Zhu,et al.  SigFree: A Signature-Free Buffer Overflow Attack Blocker , 2010, IEEE Transactions on Dependable and Secure Computing.

[113]  Karl Trygve Kalleberg,et al.  Finding software license violations through binary code clone detection , 2011, MSR '11.

[114]  Steven Roman,et al.  Coding and information theory , 1992 .

[115]  Barton P. Miller,et al.  Identifying Multiple Authors in a Binary Program , 2017, ESORICS.

[116]  Sylvain Paris,et al.  Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code , 2015, PLDI.

[117]  Mark Stamp,et al.  Hunting for undetectable metamorphic viruses , 2011, Journal in Computer Virology.

[118]  Iván Arce,et al.  BARF: a multiplatform open source binary analysis and reverse engineering framework , 2014 .

[119]  Xiaozhu Meng,et al.  Fine-grained binary code authorship identification , 2016, SIGSOFT FSE.

[120]  Christian Rossow,et al.  Leveraging semantic signatures for bug search in binary programs , 2014, ACSAC.

[121]  Herbert Bos,et al.  Howard: A Dynamic Excavator for Reverse Engineering Data Structures , 2011, NDSS.

[122]  Helmut Veith,et al.  Jakstab: A Static Analysis Platform for Binaries , 2008, CAV.

[123]  Sean Heelan sean. heelan,et al.  Augmenting vulnerability analysis of binary code , 2012, ACSAC '12.

[124]  Sagar Chaki,et al.  Supervised learning for provenance-similarity of binaries , 2011, KDD.

[125]  Barton P. Miller,et al.  Recovering the toolchain provenance of binary code , 2011, ISSTA '11.

[126]  Zhenkai Liang,et al.  BitBlaze: A New Approach to Computer Security via Binary Analysis , 2008, ICISS.

[127]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[128]  Pierre-Alain Fouque,et al.  Automated Identification of Cryptographic Primitives in Binary Code with Data Flow Graph Isomorphism , 2015, AsiaCCS.

[129]  Wanlei Zhou,et al.  Control Flow-Based Malware VariantDetection , 2014, IEEE Transactions on Dependable and Secure Computing.

[130]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[131]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[132]  David Brumley,et al.  Optimizing Seed Selection for Fuzzing , 2014, USENIX Security Symposium.

[133]  DavidYaniv,et al.  Statistical similarity of binaries , 2016 .

[134]  RinardMartin,et al.  Automatic runtime error repair and containment via recovery shepherding , 2014 .

[135]  Martin Fowler,et al.  Refactoring - Improving the Design of Existing Code , 1999, Addison Wesley object technology series.

[136]  Christopher Krügel,et al.  Identifying Dormant Functionality in Malware Programs , 2010, 2010 IEEE Symposium on Security and Privacy.

[137]  David W. Binkley,et al.  Interprocedural slicing using dependence graphs , 1990, TOPL.

[138]  David Brumley,et al.  BitShred: Fast, Scalable Code Reuse Detection in Binary Code (CMU-CyLab-10-006) , 2007 .

[139]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[140]  Xiangyu Zhang,et al.  Automatic Reverse Engineering of Data Structures from Binary Execution , 2010, NDSS.

[141]  Zi Huang,et al.  SK-LSH: An Efficient Index Structure for Approximate Nearest Neighbor Search , 2014, Proc. VLDB Endow..

[142]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[143]  Arun Lakhotia,et al.  Fast location of similar code fragments using semantic 'juice' , 2013, PPREW '13.

[144]  Anne-Laure Jousselme,et al.  Distances in evidence theory: Comprehensive survey and generalizations , 2012, Int. J. Approx. Reason..

[145]  Arvind Narayanan,et al.  When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries , 2015, NDSS.

[146]  MartignoniLorenzo,et al.  Path-exploration lifting , 2012 .

[147]  Thomas W. Reps,et al.  CodeSurfer/x86-A Platform for Analyzing x86 Executables , 2005, CC.

[148]  Daniel Kroening,et al.  MSc Computer Science Dissertation Automatic Generation of Control Flow Hijacking Exploits for Software Vulnerabilities , 2009 .

[149]  Christopher Krügel,et al.  A survey on automated dynamic malware-analysis techniques and tools , 2012, CSUR.

[150]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.