Learning Program-Wide Code Representations for Binary Diffing

Author(s): Li, Xuezixiang | Advisor(s): Yin, Heng | Abstract: Binary diffing analysis quantitatively measures the differences between two given binaries and produces fine-grained basic block matching. It has been widely used to enable different kinds of critical security analysis. However, all existing program analysis and learning based techniques suffer from low accuracy, poor scalability, coarse granularity especially on COTS binaries which did not contains complete debug information. On the other hands, some learning based approaches require extensive labeled training data to function, so that precise labelled and representative dataset is needed to obtain great results. To surmount such limitations, in this paper, we come up with a novel learning based code representation generation approach to solve the binary diffing problem. We rely only on the code semantic information as well as the program-wide control flow structural information to generate block embeddings without supporting of any debug information. Furthermore, we propose a K-hop greedy matching algorithm to find the optimal diffing results using the generated block representations. We implement a prototype called DeepBinDiff and evaluate its effectiveness and efficiency with large number of binaries and real-world vulnerabilities. The results show that our tool could outperform the state-of-the-art binary diffing tools by large margin for both cross-version and cross-optimization level diffing. A case study for OpenSSL using real-world vulnerabilities further demonstrates the usefulness of our system.

[1]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[2]  Jian Pei,et al.  Asymmetric Transitivity Preserving Graph Embedding , 2016, KDD.

[3]  Herbert Bos,et al.  Now You See Me: Real-time Dynamic Function Call Detection , 2018, ACSAC.

[4]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[5]  Keith D. Cooper,et al.  Engineering a Compiler , 2003 .

[6]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[7]  Eran Yahav,et al.  Similarity of binaries through re-optimization , 2017, PLDI.

[8]  Benjamin C. M. Fung,et al.  BinClone: Detecting Code Clones in Malware , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[9]  Debin Gao,et al.  BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.

[10]  Wenwu Zhu,et al.  Structural Deep Network Embedding , 2016, KDD.

[11]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[12]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[13]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[14]  Jiang Ming,et al.  BinSim: Trace-based Semantic Binary Diffing via System Call Sliced Segment Equivalence Checking , 2017, USENIX Security Symposium.

[15]  Chao Zhang,et al.  $\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[16]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[17]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[18]  Inderjit S. Dhillon,et al.  Parallel matrix factorization for recommender systems , 2014, Knowl. Inf. Syst..

[19]  Yang Liu,et al.  SPAIN: Security Patch Analysis for Binaries towards Understanding the Pain and Pills , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[20]  Sencun Zhu,et al.  Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection , 2014, SIGSOFT FSE.

[21]  Mark Heimann,et al.  REGAL: Representation Learning-based Graph Alignment , 2018, CIKM.

[22]  Giuseppe Antonio Di Luna,et al.  SAFE: Self-Attentive Function Embeddings for Binary Similarity , 2018, DIMVA.

[23]  Wei Lu,et al.  Deep Neural Networks for Learning Graph Representations , 2016, AAAI.

[24]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[25]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[26]  Patrick D. McDaniel,et al.  BinDNN: Resilient Function Matching Using Deep Learning , 2016, SecureComm.

[27]  Xiaopeng Li,et al.  Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs , 2018, NDSS.

[28]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[29]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[30]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[31]  Eran Yahav,et al.  FirmUp: Precise Static Detection of Common Vulnerabilities in Firmware , 2018, ASPLOS.

[32]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[33]  Debin Gao,et al.  iBinHunt: Binary Hunting with Inter-procedural Control Flow , 2012, ICISC.

[34]  Deli Zhao,et al.  Network Representation Learning with Rich Text Information , 2015, IJCAI.

[35]  Eran Yahav,et al.  Statistical similarity of binaries , 2016, PLDI.

[36]  David Brumley,et al.  AEG: Automatic Exploit Generation , 2011, NDSS.

[37]  Andy King,et al.  BinSlayer: accurate comparison of binary executables , 2013, PPREW '13.

[38]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[39]  Le Song,et al.  Discriminative Embeddings of Latent Variable Models for Structured Data , 2016, ICML.

[40]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[41]  Thomas Dullien,et al.  Graph-based comparison of Executable Objects , 2005 .

[42]  Dinghao Wu,et al.  In-memory fuzzing for binary code similarity analysis , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[43]  Yong Tang,et al.  SemHunt: Identifying Vulnerability Type with Double Validation in Binary Code , 2017, SEKE.

[44]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[45]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[46]  Mu Zhang,et al.  Things You May Not Know About Android (Un)Packers: A Systematic Study based on Whole-System Emulation , 2018, NDSS.

[47]  Jiang Ming,et al.  Memoized Semantics-Based Binary Diffing with Application to Malware Lineage Inference , 2015, SEC.

[48]  Davide Balzarotti,et al.  SoK: Deep Packer Inspection: A Longitudinal Study of the Complexity of Run-Time Packers , 2015, 2015 IEEE Symposium on Security and Privacy.