A Survey of Binary Code Similarity

Binary code similarity approaches compare two or more pieces of binary code to identify their similarities and differences. The ability to compare binary code enables many real-world applications on scenarios where source code may not be available such as patch analysis, bug search, and malware detection and analysis. Over the past 20 years numerous binary code similarity approaches have been proposed, but the research area has not yet been systematically analyzed. This paper presents a first survey of binary code similarity. It analyzes 61 binary code similarity approaches, which are systematized on four aspects: (1) the applications they enable, (2) their approach characteristics, (3) how the approaches are implemented, and (4) the benchmarks and methodologies used to evaluate them. In addition, the survey discusses the scope and origins of the area, its evolution over the past two decades, and the challenges that lie ahead.

[1]  Benjamin C. M. Fung,et al.  Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering , 2016, KDD.

[2]  Eran Yahav,et al.  FirmUp: Precise Static Detection of Common Vulnerabilities in Firmware , 2018, ASPLOS.

[3]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[4]  Davide Balzarotti,et al.  Beyond Precision and Recall: Understanding Uses (and Misuses) of Similarity Hashes in Binary Analysis , 2018, CODASPY.

[5]  Somesh Jha,et al.  Malware Lineage in the Wild , 2017, Comput. Secur..

[6]  Priya Narasimhan,et al.  Binary Function Clustering Using Semantic Hashes , 2012, 2012 11th International Conference on Machine Learning and Applications.

[7]  Jesse D. Kornblum Identifying almost identical files using context triggered piecewise hashing , 2006, Digit. Investig..

[8]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[9]  U. Flegel,et al.  Detection of Intrusions and Malware & Vulnerability Assessment , 2004 .

[10]  Christopher Krügel,et al.  Behavior-based Spyware Detection , 2006, USENIX Security Symposium.

[11]  Wanlei Zhou,et al.  Control Flow-Based Malware VariantDetection , 2014, IEEE Transactions on Dependable and Secure Computing.

[12]  Christopher Meek,et al.  Adversarial learning , 2005, KDD '05.

[13]  Stephanie Forrest,et al.  A sense of self for Unix processes , 1996, Proceedings 1996 IEEE Symposium on Security and Privacy.

[14]  Latifur Khan,et al.  BCD: Decomposing Binary Code Into Components Using Graph-Based Clustering , 2018, AsiaCCS.

[15]  Eran Yahav,et al.  Similarity of binaries through re-optimization , 2017, PLDI.

[16]  Debin Gao,et al.  BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.

[17]  Hang Zhang,et al.  Precise and Accurate Patch Presence Test for Binaries , 2018, USENIX Security Symposium.

[18]  Vladimir A. Zakharov,et al.  The Equivalence Problem for Computational Models: Decidable and Undecidable Cases , 2001, MCU.

[19]  Christian S. Collberg,et al.  Distributed application tamper detection via continuous software updates , 2012, ACSAC '12.

[20]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[21]  Zheng Wang,et al.  BMAT - A Binary Matching Tool for Stale Profile Propagation , 2000, J. Instr. Level Parallelism.

[22]  Amitabh Srivastava,et al.  Vulcan Binary transformation in a distributed environment , 2001 .

[23]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[24]  Hao Chen,et al.  AnDarwin: Scalable Detection of Semantically Similar Android Applications , 2013, ESORICS.

[25]  Harald Baier,et al.  Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2 , 2012, ICDF2C.

[26]  Zhi Wang,et al.  ReFormat: Automatic Reverse Engineering of Encrypted Messages , 2009, ESORICS.

[27]  Giuseppe Antonio Di Luna,et al.  SAFE: Self-Attentive Function Embeddings for Binary Similarity , 2018, DIMVA.

[28]  Eul Gyu Im,et al.  Malware classification method via binary content comparison , 2012, RACS.

[29]  Stefano Zanero,et al.  Lines of malicious code: insights into the malicious software industry , 2012, ACSAC '12.

[30]  David Brumley,et al.  BAP: A Binary Analysis Platform , 2011, CAV.

[31]  Sencun Zhu,et al.  Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection , 2017, IEEE Transactions on Software Engineering.

[32]  Eran Yahav,et al.  Statistical similarity of binaries , 2016, PLDI.

[33]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[34]  Yang Liu,et al.  SPAIN: Security Patch Analysis for Binaries towards Understanding the Pain and Pills , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[35]  Rudolf Eigenmann,et al.  Compiler Infrastructure , 2013, International Journal of Parallel Programming.

[36]  Christopher Kruegel,et al.  Detection of Intrusions and Malware, and Vulnerability Assessment , 2019, Lecture Notes in Computer Science.

[37]  Nick Feamster,et al.  Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces , 2010, NSDI.

[38]  Christian Rossow,et al.  Leveraging semantic signatures for bug search in binary programs , 2014, ACSAC.

[39]  Eul Gyu Im,et al.  Function matching-based binary-level software similarity calculation , 2013, RACS.

[40]  Debin Gao,et al.  iBinHunt: Binary Hunting with Inter-procedural Control Flow , 2012, ICISC.

[41]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[42]  Xuezixiang Li,et al.  Learning Program-Wide Code Representations for Binary Diffing , 2019, NDSS.

[43]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[44]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[45]  Jiang Ming,et al.  Cryptographic Function Detection in Obfuscated Binaries via Bit-Precise Symbolic Loop Mapping , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[46]  Saumya K. Debray,et al.  Obfuscation of executable code to improve resistance to static disassembly , 2003, CCS '03.

[47]  Jonathon T. Giffin,et al.  Automatic Reverse Engineering of Malware Emulators , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[48]  Zhiqiang Lin,et al.  Type Inference on Executables , 2016, ACM Comput. Surv..

[49]  Lingyu Wang,et al.  BINARM: Scalable and Efficient Detection of Vulnerabilities in Firmware Images of Intelligent Electronic Devices , 2018, DIMVA.

[50]  Lannan Luo,et al.  A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis , 2018, Proceedings 2019 Workshop on Binary Analysis Research.

[51]  Eul Gyu Im,et al.  Binary executable file similarity calculation using function matching , 2016, The Journal of Supercomputing.

[52]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[53]  Zhengzi Xu,et al.  Patch based vulnerability matching for binary programs , 2020, ISSTA.

[54]  Xiaohong Su,et al.  Using Reduced Execution Flow Graph to Identify Library Functions in Binary Code , 2016, IEEE Transactions on Software Engineering.

[55]  Juanru Li,et al.  Cross-Architecture Binary Semantics Understanding via Similar Code Comparison , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[56]  Mourad Debbabi,et al.  BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables , 2017, SEC.

[57]  Yang Liu,et al.  Accurate and Scalable Cross-Architecture Cross-OS Binary Code Search with Emulation , 2019, IEEE Transactions on Software Engineering.

[58]  Barton P. Miller,et al.  Binary-code obfuscations in prevalent packer tools , 2013, CSUR.

[59]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[60]  Sencun Zhu,et al.  Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection , 2014, SIGSOFT FSE.

[61]  Thomas Dullien,et al.  Graph-based comparison of Executable Objects , 2005 .

[62]  Jürgen Roßmann,et al.  The virtual forest: Robotics and simulation technology as the basis for new approaches to the biological and the technical production in the forest , 2009 .

[63]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[64]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[65]  Arun Lakhotia,et al.  Fast location of similar code fragments using semantic 'juice' , 2013, PPREW '13.

[66]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[67]  Yu Fu,et al.  VMHunt: A Verifiable Approach to Partially-Virtualized Binary Code Simplification , 2018, CCS.

[68]  Lingyu Wang,et al.  SIGMA: A Semantic Integrated Graph Matching Approach for identifying reused functions in binary code , 2015, Digit. Investig..

[69]  Walter F. Tichy,et al.  Rcs — a system for version control , 1985, Softw. Pract. Exp..

[70]  Peng Liu,et al.  Achieving accuracy and scalability simultaneously in detecting application clones on Android markets , 2014, ICSE.

[71]  Fetri Reni,et al.  An Analysis of Racial Discriminations as Seen in Kathryn Stokett’s Novel The Help , 2019 .

[72]  Amr M. Youssef,et al.  BinSequence: Fast, Accurate and Scalable Binary Code Reuse Detection , 2017, AsiaCCS.

[73]  Christoph Reichenberger,et al.  Delta storage for arbitrary non-text files , 1991, SCM '91.

[74]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[75]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[76]  Lingyu Wang,et al.  BinGold: Towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGs) , 2016 .

[77]  David L. Dill,et al.  A Decision Procedure for Bit-Vectors and Arrays , 2007, CAV.

[78]  Nahid Shahmehri,et al.  Towards robust instruction-level trace alignment of binary code , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[79]  Mattia Monga,et al.  Code Normalization for Self-Mutating Malware , 2007, IEEE Security & Privacy.

[80]  Lingyu Wang,et al.  BinShape: Scalable and Robust Binary Library Function Identification Using Function Shape , 2017, DIMVA.

[81]  Benjamin C. M. Fung,et al.  BinClone: Detecting Code Clones in Malware , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[82]  David Brumley,et al.  Automatic Patch-Based Exploit Generation is Possible: Techniques and Implications , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[83]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.

[84]  Kang G. Shin,et al.  Large-scale malware indexing using function-call graphs , 2009, CCS.

[85]  Zebin Yang,et al.  Enhancing Explainability of Neural Networks Through Architecture Constraints , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[86]  Juanru Li,et al.  Binary Code Clone Detection across Architectures and Compiling Configurations , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[87]  Chao Zhang,et al.  $\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[88]  Davide Balzarotti,et al.  SoK: Deep Packer Inspection: A Longitudinal Study of the Complexity of Run-Time Packers , 2015, 2015 IEEE Symposium on Security and Privacy.

[89]  Marc J. Rochkind,et al.  The source code control system , 1975, IEEE Transactions on Software Engineering.

[90]  Mu Zhang,et al.  Extracting Conditional Formulas for Cross-Platform Bug Search , 2017, AsiaCCS.

[91]  Jiang Ming,et al.  Memoized Semantics-Based Binary Diffing with Application to Malware Lineage Inference , 2015, SEC.

[92]  Georg Wicherski,et al.  peHash: A Novel Approach to Fast Malware Clustering , 2009, LEET.

[93]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[94]  Kang G. Shin,et al.  MutantX-S: Scalable Malware Clustering Based on Static Features , 2013, USENIX Annual Technical Conference.

[95]  Andy King,et al.  BinSlayer: accurate comparison of binary executables , 2013, PPREW '13.

[96]  Fengyu Wang,et al.  Matching Similar Functions in Different Versions of a Malware , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[97]  Lingyu Wang,et al.  FOSSIL: A Resilient and Efficient System for Identifying FOSS Functions in Malware Binaries , 2018, ACM Trans. Priv. Secur..

[98]  Atul Prakash,et al.  Expose: Discovering Potential Binary Code Re-use , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[99]  Xiaopeng Li,et al.  Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs , 2018, NDSS.

[100]  James Newsome,et al.  Polygraph: automatically generating signatures for polymorphic worms , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[101]  Xiaohong Su,et al.  Library functions identification in binary code by using graph isomorphism testings , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[102]  Patrick D. McDaniel,et al.  BinDNN: Resilient Function Matching Using Deep Learning , 2016, SecureComm.

[103]  Brenda S. Baker,et al.  Compressing Differences of Executable Code , 2012 .

[104]  Yu Jiang,et al.  VulSeeker: A Semantic Learning Based Vulnerability Seeker for Cross-Platform Binary , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[105]  Juan Caballero,et al.  FIRMA: Malware Clustering and Network Signature Generation with Mixed Network Behaviors , 2013, RAID.

[106]  Hyun-il Lim,et al.  A Static Birthmark of Binary Executables Based on API Call Structure , 2007, ASIAN.

[107]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[108]  Zheng Wang,et al.  BMAT -- A Binary Matching Tool , 1999 .

[109]  Mattia Monga,et al.  Detecting Self-mutating Malware Using Control-Flow Graph Matching , 2006, DIMVA.

[110]  Gurindar S. Sohi,et al.  Master/slave speculative parallelization and approximate code , 2002 .

[111]  Benjamin C. M. Fung,et al.  Scalable code clone search for malware analysis , 2015, Digit. Investig..

[112]  Anna Philippou,et al.  Tools and Algorithms for the Construction and Analysis of Systems , 2018, Lecture Notes in Computer Science.

[113]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[114]  Yoseba K. Penya,et al.  Idea: Opcode-Sequence-Based Malware Detection , 2010, ESSoS.

[115]  Jonathan Oliver,et al.  TLSH -- A Locality Sensitive Hash , 2013, 2013 Fourth Cybercrime and Trustworthy Computing Workshop.

[116]  David Brumley,et al.  ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions , 2012, 2012 IEEE Symposium on Security and Privacy.

[117]  Jiang Ming,et al.  BinSim: Trace-based Semantic Binary Diffing via System Call Sliced Segment Equivalence Checking , 2017, USENIX Security Symposium.

[118]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[119]  Arun Lakhotia,et al.  Identifying Shared Software Components to Support Malware Forensics , 2014, DIMVA.

[120]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[121]  Ming-Yang Kao,et al.  Hamsa: fast signature generation for zero-day polymorphic worms with provable attack resilience , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).

[122]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[123]  Max Welling,et al.  Variational Graph Auto-Encoders , 2016, ArXiv.

[124]  David Brumley,et al.  Towards Automatic Software Lineage Inference , 2013, USENIX Security Symposium.

[125]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[126]  Christian S. Collberg,et al.  K-gram based software birthmarks , 2005, SAC '05.

[127]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[128]  Vassil Roussev,et al.  Data Fingerprinting with Similarity Digests , 2010, IFIP Int. Conf. Digital Forensics.

[129]  Christopher Krügel,et al.  A survey on automated dynamic malware-analysis techniques and tools , 2012, CSUR.

[130]  Dinghao Wu,et al.  In-memory fuzzing for binary code similarity analysis , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).