Using recurrent neural networks for decompilation

Decompilation, recovering source code from binary, is useful in many situations where it is necessary to analyze or understand software for which source code is not available. Source code is much easier for humans to read than binary code, and there are many tools available to analyze source code. Existing decompilation techniques often generate source code that is difficult for humans to understand because the generated code often does not use the coding idioms that programmers use. Differences from human-written code also reduce the effectiveness of analysis tools on the decompiled source code. To address the problem of differences between decompiled code and human-written code, we present a novel technique for decompiling binary code snippets using a model based on Recurrent Neural Networks. The model learns properties and patterns that occur in source code and uses them to produce decompilation output. We train and evaluate our technique on snippets of binary machine code compiled from C source code. The general approach we outline in this paper is not language-specific and requires little or no domain knowledge of a language and its properties or how a compiler operates, making the approach easily extensible to new languages and constructs. Furthermore, the technique can be extended and applied in situations to which traditional decompilers are not targeted, such as for decompilation of isolated binary snippets; fast, on-demand decompilation; domain-specific learned decompilation; optimizing for readability of decompilation; and recovering control flow constructs, comments, and variable or function names. We show that the translations produced by this technique are often accurate or close and can provide a useful picture of the snippet's behavior.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Premkumar T. Devanbu,et al.  Recovering clear, natural identifiers from obfuscated JS names , 2017, ESEC/SIGSOFT FSE.

[3]  Trong Duc Nguyen,et al.  Combining Word2Vec with Revised Vector Space Model for Better Code Retrieval , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[4]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[5]  Truyen Tran,et al.  A deep language model for software code , 2016, FSE 2016.

[6]  Alexander Meduna,et al.  Design of a Retargetable Decompiler for a Static Platform-Independent Malware Analysis , 2011, ISA.

[7]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8]  David Brumley,et al.  Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring , 2013, USENIX Security Symposium.

[9]  Zhenkai Liang,et al.  BitBlaze: A New Approach to Computer Security via Binary Analysis , 2008, ICISS.

[10]  Niranjan Hasabnis,et al.  Lifting Assembly to Intermediate Representation: A Novel Approach Leveraging Compilers , 2016, ASPLOS.

[11]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[12]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[13]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[14]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[15]  Arvind Narayanan,et al.  When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries , 2015, NDSS.

[16]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[17]  Alexander Meduna,et al.  Design of an automatically generated retargetable decompiler , 2011 .

[18]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[19]  H. Siegelmann Foundations of recurrent neural networks , 1993 .

[20]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[21]  KrauseAndreas,et al.  Predicting program properties from 'big code' , 2015 .

[22]  Zhi Jin,et al.  Building Program Vector Representations for Deep Learning , 2014, KSEM.

[23]  Anh Tuan Nguyen,et al.  Lexical statistical machine translation for language migration , 2013, ESEC/FSE 2013.

[24]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[25]  Petr Zemek,et al.  PsybOt malware: A step-by-step decompilation case study , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[26]  Khaled Yakdan,et al.  No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations , 2015, NDSS.

[27]  David Brumley,et al.  Towards Automated Dynamic Analysis for Linux-based Embedded Firmware , 2016, NDSS.

[28]  Lior Wolf,et al.  Learning to Align the Source Code to the Compiled Object Code , 2017, ICML.

[29]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[30]  Tomoki Toda,et al.  Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[31]  Kai Chen,et al.  A Refined Decompiler to Generate C Code with High Readability , 2010, 2010 17th Working Conference on Reverse Engineering.

[32]  Martin T. Vechev,et al.  Phrase-Based Statistical Translation of Programming Languages , 2014, Onward!.

[33]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[34]  Khaled Yakdan,et al.  Helping Johnny to Analyze Malware: A Usability-Optimized Decompiler and Malware Analysis User Study , 2016, 2016 IEEE Symposium on Security and Privacy (SP).