Neural reverse engineering of stripped binaries using augmented control flow graphs

We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code patterns arising from compiler optimizations. We present a novel approach for predicting procedure names in stripped executables. Our approach combines static analysis with neural models. The main idea is to use static analysis to obtain augmented representations of call sites; encode the structure of these call sites using the control-flow graph (CFG) and finally, generate a target name while attending to these call sites. We use our representation to drive graph-based, LSTM-based and Transformer-based architectures. Our evaluation shows that our models produce predictions that are difficult and time consuming for humans, while improving on existing methods by 28% and by 100% over state-of-the-art neural textual models that do not use any static analysis. Code and data for this evaluation are available at https://github.com/tech-srl/Nero.

[1]  Jeffrey S. Foster,et al.  An Observational Investigation of Reverse Engineers' Processes , 2019, USENIX Security Symposium.

[2]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[3]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[4]  Martin T. Vechev,et al.  PHOG: Probabilistic Model for Code , 2016, ICML.

[5]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[6]  Graham Neubig,et al.  DIRE: A Neural Approach to Decompiled Identifier Naming , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[7]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[8]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[9]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[10]  Eran Yahav,et al.  Similarity of binaries through re-optimization , 2017, PLDI.

[11]  Aditya V. Thakur,et al.  Path-based function embedding and its application to error-handling specification mining , 2018, ESEC/SIGSOFT FSE.

[12]  Thomas W. Reps,et al.  A Next-Generation Platform for Analyzing Executables , 2005, APLAS.

[13]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[14]  Swarat Chaudhuri,et al.  Data-Driven Program Completion , 2017, ArXiv.

[15]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Martin T. Vechev,et al.  Probabilistic model for code with decision trees , 2016, OOPSLA.

[20]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[21]  Koushik Sen,et al.  Retrieval on source code: a neural code search , 2018, MAPL@PLDI.

[22]  Uri Alon,et al.  A general path-based representation for predicting program properties , 2018, PLDI.

[23]  Mauricio A. Saca Refactoring improving the design of existing code , 2017, 2017 IEEE 37th Central America and Panama Convention (CONCAPAN XXXVII).

[24]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[25]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[26]  Eran Yahav,et al.  Statistical Reconstruction of Class Hierarchies in Binaries , 2018, ASPLOS.

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Marc Brockschmidt,et al.  Structured Neural Summarization , 2018, ICLR.

[29]  Michael Pradel,et al.  Detecting argument selection defects , 2017, Proc. ACM Program. Lang..

[30]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[31]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[32]  Mark Weiser,et al.  Program Slicing , 1981, IEEE Transactions on Software Engineering.

[33]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[34]  Alvin Cheung,et al.  Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[35]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[36]  Aditya V. Thakur,et al.  Path-Based Function Embedding and its Application to Specification Mining , 2018, ArXiv.

[37]  Barton P. Miller,et al.  Labeling library functions in stripped binaries , 2011, PASTE '11.

[38]  Swarat Chaudhuri,et al.  Bayesian Sketch Learning for Program Synthesis , 2017, ArXiv.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[41]  Oleksandr Polozov,et al.  Generative Code Modeling with Graphs , 2018, ICLR.

[42]  Koushik Sen,et al.  Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts , 2017, ArXiv.

[43]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[44]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[46]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[47]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[48]  David Brumley,et al.  TIE: Principled Reverse Engineering of Types in Binary Programs , 2011, NDSS.

[49]  Uri Alon,et al.  Structural Language Models for Any-Code Generation , 2019, ArXiv.

[50]  Petar Tsankov,et al.  Debin: Predicting Debug Information in Stripped Binaries , 2018, CCS.

[51]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[52]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[53]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[54]  Einar W. Høst,et al.  Debugging Method Names , 2009, ECOOP.

[55]  R. Edmonds PolyUnpack : Automating the Hidden-Code Extraction of , 2006 .

[56]  Miltiadis Allamanis,et al.  The adverse effects of code duplication in machine learning models of code , 2018, Onward!.