Neural Reverse Engineering of Stripped Binaries

We address the problem of predicting procedure names in stripped executables which contain no debug information. Predicting procedure names can dramatically ease the task of reverse engineering, saving precious time and human effort. We present a novel approach that leverages static analysis of binaries with encoder-decoder-based neural networks. The main idea is to use static analysis to obtain enriched representations of API call sites; encode a set of sequences of these call sites; and finally, attend to the encoded sequences while decoding the target name token-by-token. We evaluate our model by predicting procedure names over $60,000$ procedures in $10,000$ stripped executables. Our model achieves $81.70$ precision and $80.12$ recall in predicting procedure names within GNU packages, and $55.48$ precision and $51.31$ recall in a diverse, cross-package, dataset. Comparing to previous approaches, the predictions made by our model are much more accurate and informative.

[1]  Eran Yahav,et al.  Similarity of binaries through re-optimization , 2017, PLDI.

[2]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[3]  Miltiadis Allamanis,et al.  The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[4]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[5]  Alvin Cheung,et al.  Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[6]  Swarat Chaudhuri,et al.  Neural Sketch Learning for Conditional Program Generation , 2017, ICLR.

[7]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Omer Levy,et al.  Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.

[9]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[10]  Thomas W. Reps,et al.  A Next-Generation Platform for Analyzing Executables , 2005, APLAS.

[11]  Petar Tsankov,et al.  Debin: Predicting Debug Information in Stripped Binaries , 2018, CCS.

[12]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[13]  Koushik Sen,et al.  Retrieval on source code: a neural code search , 2018, MAPL@PLDI.

[14]  Uri Alon,et al.  A general path-based representation for predicting program properties , 2018, PLDI.

[15]  Mauricio A. Saca Refactoring improving the design of existing code , 2017, 2017 IEEE 37th Central America and Panama Convention (CONCAPAN XXXVII).

[16]  Marc Brockschmidt,et al.  Structured Neural Summarization , 2018, ICLR.

[17]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18]  Thomas W. Reps,et al.  A Next-Generation Platform for Analyzing Executables , 2005, APLAS.

[19]  Oleksandr Polozov,et al.  Generative Code Modeling with Graphs , 2018, ICLR.

[20]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[21]  Eran Yahav,et al.  Statistical Reconstruction of Class Hierarchies in Binaries , 2018, ASPLOS.

[22]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[23]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[25]  Barton P. Miller,et al.  Labeling library functions in stripped binaries , 2011, PASTE '11.

[26]  Swarat Chaudhuri,et al.  Bayesian Sketch Learning for Program Synthesis , 2017, ArXiv.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[29]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  Martin T. Vechev,et al.  Probabilistic model for code with decision trees , 2016, OOPSLA.

[32]  Swarat Chaudhuri,et al.  Data-Driven Program Completion , 2017, ArXiv.

[33]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[34]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[35]  Graham Neubig,et al.  DIRE: A Neural Approach to Decompiled Identifier Naming , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[36]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[37]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[40]  David Brumley,et al.  TIE: Principled Reverse Engineering of Types in Binary Programs , 2011, NDSS.

[41]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Uri Alon,et al.  Structural Language Models for Any-Code Generation , 2019, ArXiv.

[43]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[44]  Martin Fowler,et al.  Refactoring - Improving the Design of Existing Code , 1999, Addison Wesley object technology series.

[45]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[46]  Koushik Sen,et al.  Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts , 2017, ArXiv.

[47]  R. Rubinstein Combinatorial Optimization, Cross-Entropy, Ants and Rare Events , 2001 .

[48]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[51]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[52]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[53]  Aditya V. Thakur,et al.  Path-based function embedding and its application to error-handling specification mining , 2018, ESEC/SIGSOFT FSE.

[54]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[55]  Michael Pradel,et al.  Detecting argument selection defects , 2017, Proc. ACM Program. Lang..

[56]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[57]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[58]  Einar W. Høst,et al.  Debugging Method Names , 2009, ECOOP.

[59]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[60]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[61]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[62]  Martin T. Vechev,et al.  PHOG: Probabilistic Model for Code , 2016, ICML.

[63]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[64]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.