Machine Learning-Based Analysis of Program Binaries: A Comprehensive Study

Binary code analysis is crucial in various software engineering tasks, such as malware detection, code refactoring, and plagiarism detection. With the rapid growth of software complexity and the increasing number of heterogeneous computing platforms, binary analysis is particularly critical and more important than ever. Traditionally adopted techniques for binary code analysis are facing multiple challenges, such as the need for cross-platform analysis, high scalability and speed, and improved fidelity, to name a few. To meet these challenges, machine learning-based binary code analysis frameworks attract substantial attention due to their automated feature extraction and drastically reduced efforts needed on large-scale programs. In this paper, we provide the taxonomy of machine learning-based binary code analysis, describe the recent advances and key findings on the topic, and discuss the key challenges and opportunities. Finally, we present our thoughts for future directions on this topic.

[1]  Konstantin Berlin,et al.  Deep neural network based malware detection using two dimensional binary program features , 2015, 2015 10th International Conference on Malicious and Unwanted Software (MALWARE).

[2]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[3]  Guru Venkataramani,et al.  Comprehensively and efficiently protecting the heap , 2006, ASPLOS XII.

[4]  Guru Venkataramani,et al.  Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing , 2018, Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation - FEAST '18.

[5]  Guillermo L. Grinblat,et al.  Toward Large-Scale Vulnerability Discovery using Machine Learning , 2016, CODASPY.

[6]  Steven Gianvecchio,et al.  Mimimorphism: a new approach to binary code obfuscation , 2010, CCS '10.

[7]  Barton P. Miller,et al.  Recovering the toolchain provenance of binary code , 2011, ISSTA '11.

[8]  Juanru Li,et al.  Binary Code Clone Detection across Architectures and Compiling Configurations , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[9]  Stefan Katzenbeisser,et al.  Protecting Software through Obfuscation , 2016, ACM Comput. Surv..

[10]  Igor Jurisica,et al.  Modeling interactome: scale-free or geometric? , 2004, Bioinform..

[11]  Yu-Kun Lai,et al.  A New Learning Approach to Malware Classification Using Discriminative Feature Extraction , 2019, IEEE Access.

[12]  Saumya Debray,et al.  A Generic Approach to Automatic Deobfuscation of Executable Code , 2015, 2015 IEEE Symposium on Security and Privacy.

[13]  U. Bayer,et al.  TTAnalyze: A Tool for Analyzing Malware , 2006 .

[14]  Barton P. Miller,et al.  Identifying Multiple Authors in a Binary Program , 2017, ESORICS.

[15]  Josef Kittler,et al.  Floating search methods for feature selection with nonmonotonic criterion functions , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[16]  John J. Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities , 1999 .

[17]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[18]  K. alik An efficient k'-means clustering algorithm , 2008 .

[19]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[20]  Jeffrey S. Foster,et al.  Understanding source code evolution using abstract syntax tree matching , 2005, MSR.

[21]  William A. Belson,et al.  Matching and Prediction on the Principle of Biological Classification , 1959 .

[22]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[23]  Dragos Gavrilut,et al.  Malware detection using machine learning , 2009, 2009 International Multiconference on Computer Science and Information Technology.

[24]  Zhenkai Liang,et al.  Automatically Identifying Trigger-based Behavior in Malware , 2008, Botnet Detection.

[25]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[26]  Tao Guo,et al.  Code Comparison System based on Abstract Syntax Tree , 2010, 2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT).

[27]  Kevin Coogan,et al.  Automatic Static Unpacking of Malware Binaries , 2009, 2009 16th Working Conference on Reverse Engineering.

[28]  Konrad Rieck,et al.  Generalized vulnerability extrapolation using abstract syntax trees , 2012, ACSAC '12.

[29]  David Brumley,et al.  TIE: Principled Reverse Engineering of Types in Binary Programs , 2011, NDSS.

[30]  Jonathon T. Giffin,et al.  Automatic Reverse Engineering of Malware Emulators , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[31]  Milos Doroslovacki,et al.  Covert Timing Channels Exploiting Non-Uniform Memory Access based Architectures , 2017, ACM Great Lakes Symposium on VLSI.

[32]  Konrad Rieck,et al.  Chucky: exposing missing checks in source code for vulnerability discovery , 2013, CCS.

[33]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[34]  H. Howie Huang,et al.  RePRAM: Re-cycling PRAM faulty blocks for extended lifetime , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[35]  Gang Wang,et al.  LEMNA: Explaining Deep Learning based Security Applications , 2018, CCS.

[36]  Xiangyu Zhang,et al.  Obfuscation resilient binary code reuse through trace-oriented programming , 2013, CCS.

[37]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[38]  Md. Rafiqul Islam,et al.  Classification of malware based on integrated static and dynamic features , 2013, J. Netw. Comput. Appl..

[39]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[40]  Diane Duros Hosfelt Automated detection and classification of cryptographic algorithms in binary programs through machine learning , 2015, ArXiv.

[41]  Helmut Veith,et al.  An Abstract Interpretation-Based Framework for Control Flow Reconstruction from Binaries , 2008, VMCAI.

[42]  Yongbo Li,et al.  StatSym: Vulnerable Path Discovery through Statistics-Guided Symbolic Execution , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[43]  Anjana Gosain,et al.  A Survey of Dynamic Program Analysis Techniques and Tools , 2014, FICTA.

[44]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[45]  Stephanie Forrest,et al.  A sense of self for Unix processes , 1996, Proceedings 1996 IEEE Symposium on Security and Privacy.

[46]  H. Howie Huang,et al.  Exploring Dynamic Redundancy to Resuscitate Faulty PCM Blocks , 2014, JETC.

[47]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[48]  Quan Qian,et al.  Deep Learning and Visualization for Identifying Malware Families , 2018, IEEE Transactions on Dependable and Secure Computing.

[49]  Konrad Rieck,et al.  Automatic Inference of Search Patterns for Taint-Style Vulnerabilities , 2015, 2015 IEEE Symposium on Security and Privacy.

[50]  Barton P. Miller,et al.  Dynamic program instrumentation for scalable performance tools , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[51]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[52]  Stephen McCamant,et al.  Binary Code Extraction and Interface Identification for Security Applications , 2009, NDSS.

[53]  Carsten Willems,et al.  A Malware Instruction Set for Behavior-Based Analysis , 2010, Sicherheit.

[54]  Henrik Theiling,et al.  Extracting safe and precise control flow from binaries , 2000, Proceedings Seventh International Conference on Real-Time Computing Systems and Applications.

[55]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[56]  Shahid Alam,et al.  Annotated Control Flow Graph for Metamorphic Malware Detection , 2015, Comput. J..

[57]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[58]  Eran Yahav,et al.  Statistical similarity of binaries , 2016, PLDI.

[59]  Yongbo Li,et al.  SIMBER: Eliminating Redundant Memory Bound Checks via Statistical Inference , 2017, SEC.

[60]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[61]  Saumya K. Debray,et al.  Obfuscation of executable code to improve resistance to static disassembly , 2003, CCS '03.

[62]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[63]  Haoran Guo,et al.  HERO: A novel malware detection framework based on binary translation , 2010, 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[64]  Yang Xiang,et al.  Classification of malware using structured control flow , 2010 .

[65]  Milos Doroslovacki,et al.  Are Coherence Protocol States Vulnerable to Information Leakage? , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[66]  Jian Wang,et al.  Learning Binary Representation for Automatic Patch Detection , 2019, 2019 16th IEEE Annual Consumer Communications & Networking Conference (CCNC).

[67]  Philip K. Chan,et al.  Learning Patterns from Unix Process Execution Traces for Intrusion Detection , 1997 .

[68]  Hee Beng Kuan Tan,et al.  Buffer Overflow Vulnerability Prediction from x86 Executables Using Static Analysis and Machine Learning , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[69]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[70]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[71]  Stefan Katzenbeisser,et al.  Code Obfuscation against Static and Dynamic Reverse Engineering , 2011, Information Hiding.

[72]  Guru Venkataramani,et al.  DamGate: Dynamic Adaptive Multi-feature Gating in Program Binaries , 2017, FEAST@CCS.

[73]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[74]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[75]  Christopher Krügel,et al.  Static Disassembly of Obfuscated Binaries , 2004, USENIX Security Symposium.

[76]  Barton P. Miller,et al.  Who Wrote This Code? Identifying the Authors of Program Binaries , 2011, ESORICS.

[77]  Yishay Mansour,et al.  On the Boosting Ability of Top-Down Decision Tree Learning Algorithms , 1999, J. Comput. Syst. Sci..

[78]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[79]  Milos Doroslovacki,et al.  DFS covert channels on multi-core platforms , 2017, 2017 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC).

[80]  Cristina Cifuentes Partial automation of an integrated reverse engineering environment of binary code , 1996, Proceedings of WCRE '96: 4rd Working Conference on Reverse Engineering.

[81]  Zhi Jin,et al.  Building Program Vector Representations for Deep Learning , 2014, KSEM.

[82]  Jianyong Dai,et al.  Efficient Virus Detection Using Dynamic Instruction Sequences , 2009, J. Comput..

[83]  P. V. Shijo,et al.  Integrated Static and Dynamic Analysis for Malware Detection , 2015 .

[84]  Guru Venkataramani,et al.  Clone-hunter: accelerated bound checks elimination via binary code clone detection , 2018, MAPL@PLDI.

[85]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[86]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[87]  Gerald Tesauro,et al.  Neural networks for computer virus recognition , 1996 .

[88]  Barton P. Miller,et al.  Learning to Analyze Binary Computer Code , 2008, AAAI.

[89]  Alessandro Orso,et al.  Effective and Efficient Memory Protection Using Dynamic Tainting , 2012, IEEE Transactions on Computers.

[90]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[91]  Lynn Batten,et al.  Classification of Malware Based on String and Function Feature Selection , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[92]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[93]  Eric Schulte,et al.  Using recurrent neural networks for decompilation , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[94]  Ulrich Bodenhofer,et al.  APCluster: an R package for affinity propagation clustering , 2011, Bioinform..

[95]  Guru Venkataramani,et al.  Tradeoffs in fine-grained heap memory protection , 2006, ASID '06.

[96]  Xiaopeng Li,et al.  Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs , 2018, NDSS.

[97]  Xiangyu Zhang,et al.  Automatic Reverse Engineering of Data Structures from Binary Execution , 2010, NDSS.

[98]  Alva Erwin,et al.  Analysis of Machine learning Techniques Used in Behavior-Based Malware Detection , 2010, 2010 Second International Conference on Advances in Computing, Control, and Telecommunication Technologies.

[99]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[100]  Guru Venkataramani,et al.  DeFT: Design space exploration for on-the-fly detection of coherence misses , 2011, TACO.

[101]  Mu Zhang,et al.  Extracting Conditional Formulas for Cross-Platform Bug Search , 2017, AsiaCCS.

[102]  Denny Davis,et al.  Capstone Design Courses and Assessment: A National Study , 2004 .

[103]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[104]  Zhenkai Liang,et al.  Neural Nets Can Learn Function Type Signatures From Binaries , 2017, USENIX Security Symposium.

[105]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[106]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[107]  Salim Hariri,et al.  Randomized Instruction Set Emulation To Disrupt Binary Code Injection Attacks , 2003 .

[108]  Guru Venkataramani,et al.  MORPH: Enhancing System Security through Interactive Customization of Application and Communication Protocol Features , 2018, CCS.

[109]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[110]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[111]  Arvind Narayanan,et al.  When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries , 2015, NDSS.

[112]  Yongbo Li,et al.  SARRE: Semantics-Aware Rule Recommendation and Enforcement for Event Paths on Android , 2016, IEEE Transactions on Information Forensics and Security.

[113]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[114]  Michael F. P. O'Boyle,et al.  MILEPOST GCC: machine learning based research compiler , 2008 .

[115]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[116]  Cordelia Schmid,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[117]  Barton P. Miller,et al.  Extracting compiler provenance from program binaries , 2010, PASTE '10.

[118]  Martin C. Libicki,et al.  The Defender's Dilemma: Charting a Course Toward Cybersecurity , 2015 .

[119]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[120]  Frances E. Allen,et al.  Control-flow analysis , 2022 .

[121]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[122]  John Grundy,et al.  Supporting automated vulnerability analysis using formalized vulnerability signatures , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[123]  Haisheng Li,et al.  Optimizing Seed Inputs in Fuzzing with Machine Learning , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[124]  Guru Venkataramani,et al.  FlexiTaint: A programmable accelerator for dynamic taint propagation , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[125]  Brenda S. Baker,et al.  Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance , 1997, SIAM J. Comput..

[126]  Ohm Sornil,et al.  Classification of malware families based on N-grams sequential pattern features , 2013, 2013 IEEE 8th Conference on Industrial Electronics and Applications (ICIEA).

[127]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[128]  Nathan E. Rosenblum,et al.  Machine Learning-Assisted Binary Code Analysis , 2007 .

[129]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[130]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[131]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[132]  Dinghao Wu,et al.  Semantics-Aware Machine Learning for Function Recognition in Binary Code , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[133]  Edward Raff,et al.  What can N-grams learn for malware detection? , 2017, 2017 12th International Conference on Malicious and Unwanted Software (MALWARE).

[134]  Guru Venkataramani,et al.  TOSS: Tailoring Online Server Systems through Binary Feature Customization , 2018 .

[135]  William W. Cohen Learning Trees and Rules with Set-Valued Features , 1996, AAAI/IAAI, Vol. 1.

[136]  Le Song,et al.  Discriminative Embeddings of Latent Variable Models for Structured Data , 2016, ICML.

[137]  Lior Wolf,et al.  Learning to Align the Source Code to the Compiled Object Code , 2017, ICML.

[138]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[139]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[140]  Qiang Wei,et al.  NeuFuzz: Efficient Fuzzing With Deep Neural Network , 2019, IEEE Access.

[141]  Guru Venkataramani,et al.  LIME: a framework for debugging load imbalance in multi-threaded execution , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[142]  Daniel A. Keim,et al.  A Survey of Visualization Systems for Malware Analysis , 2015, EuroVis.

[143]  Guru Venkataramani,et al.  MemTracker: An accelerator for memory debugging and monitoring , 2009, TACO.

[144]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.