Semantics-Recovering Decompilation through Neural Machine Translation

Decompilation transforms low-level program languages (PL) (e.g., binary code) into high-level PLs (e.g., C/C++). It has been widely used when analysts perform security analysis on software (systems) whose source code is unavailable, such as vulnerability search and malware analysis. However, current decompilation tools usually need lots of experts’ efforts, even for years, to generate the rules for decompilation, which also requires long-term maintenance as the syntax of high-level PL or low-level PL changes. Also, an ideal decompiler should concisely generate high-level PL with similar functionality to the source low-level PL and semantic information (e.g., meaningful variable names), just like human-written code. Unfortunately, existing manually-defined rule-based decompilation techniques only functionally restore the low-level PL to a similar high-level PL and are still powerless to recover semantic information. In this paper, we propose a novel neural decompilation approach to translate low-level PL into accurate and user-friendly high-level PL, effectively improving its readability and understandability. Furthermore, we implement the proposed approaches called SEAM. Evaluations on four real-world applications show that SEAM has an average accuracy of 94.41%, which is much better than prior neural machine translation (NMT) models. Finally, we evaluate the effectiveness of semantic information recovery through a questionnaire survey, and the average accuracy is 92.64%, which is comparable or superior to the state-of-the-art compilers.

[1]  M. Halstead Machine-independent computer programming , 1962 .

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[4]  Michael Van Emmerik,et al.  Static single assignment for decompilation , 2007 .

[5]  Barton P. Miller,et al.  Learning to Analyze Binary Computer Code , 2008, AAAI.

[6]  Nikos Karampatziakis Static Analysis of Binary Executables Using Structural SVMs , 2010, NIPS.

[7]  David Brumley,et al.  TIE: Principled Reverse Engineering of Types in Binary Programs , 2011, NDSS.

[8]  Alexander Aiken,et al.  Stochastic superoptimization , 2012, ASPLOS '13.

[9]  David Brumley,et al.  Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring , 2013, USENIX Security Symposium.

[10]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[11]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[13]  Khaled Yakdan,et al.  No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations , 2015, NDSS.

[14]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  Khaled Yakdan,et al.  Helping Johnny to Analyze Malware: A Usability-Optimized Decompiler and Malware Analysis User Study , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[17]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[18]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Zhenkai Liang,et al.  Neural Nets Can Learn Function Type Signatures From Binaries , 2017, USENIX Security Symposium.

[21]  Petar Tsankov,et al.  Debin: Predicting Debug Information in Stripped Binaries , 2018, CCS.

[22]  Claire Le Goues,et al.  Suggesting meaningful variable names for decompiled code: a machine translation approach , 2017, ESEC/SIGSOFT FSE.

[23]  Eric Schulte,et al.  Using recurrent neural networks for decompilation , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[24]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[25]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[26]  Eran Yahav,et al.  Towards Neural Decompilation , 2019, ArXiv.

[27]  Yuandong Tian,et al.  Coda: An End-to-End Neural Program Decompiler , 2019, NeurIPS.

[28]  GMP , 2019, Springer Reference Medizin.

[29]  Graham Neubig,et al.  DIRE: A Neural Approach to Decompiled Identifier Naming , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Orhan Firat,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[32]  Ruigang Liang,et al.  Neutron: an attention-based neural decompiler , 2021, Cybersecurity.