A Neural-based Program Decompiler

Reverse engineering of binary executables is a critical problem in the computer security domain. On the one hand, malicious parties may recover interpretable source codes from the software products to gain commercial advantages. On the other hand, binary decompilation can be leveraged for code vulnerability analysis and malware detection. However, efficient binary decompilation is challenging. Conventional decompilers have the following major limitations: (i) they are only applicable to specific source-target language pair, hence incurs undesired development cost for new language tasks; (ii) their output high-level code cannot effectively preserve the correct functionality of the input binary; (iii) their output program does not capture the semantics of the input and the reversed program is hard to interpret. To address the above problems, we propose Coda, the first end-to-end neural-based framework for code decompilation. Coda decomposes the decompilation task into two key phases: First, Coda employs an instruction type-aware encoder and a tree decoder for generating an abstract syntax tree (AST) with attention feeding during the code sketch generation stage. Second, Coda then updates the code sketch using an iterative error correction machine guided by an ensembled neural error predictor. By finding a good approximate candidate and then fixing it towards perfect, Coda achieves superior performance compared to baseline approaches. We assess Coda's performance with extensive experiments on various benchmarks. Evaluation results show that Coda achieves an average of 82\% program recovery accuracy on unseen binary samples, where the state-of-the-art decompilers yield 0\% accuracy. Furthermore, Coda outperforms the sequence-to-sequence model with attention by a margin of 70\% program accuracy.

[1]  Heikki Hyyrö Explaining and Extending the Bit-parallel Approximate String Matching Algorithm of Myers , 2001 .

[2]  Dawn Xiaodong Song,et al.  Learning Neural Programs To Parse Programs , 2017, ArXiv.

[3]  Anh Tuan Nguyen,et al.  Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[4]  Lingyu Wang,et al.  BinComp: A stratified approach to compiler provenance Attribution , 2015, Digit. Investig..

[5]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[6]  Khaled Yakdan,et al.  Helping Johnny to Analyze Malware: A Usability-Optimized Decompiler and Malware Analysis User Study , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[7]  Leonidas J. Guibas,et al.  Learning Program Embeddings to Propagate Feedback on Student Code , 2015, ICML.

[8]  Pattis Karel the Robot , 2000 .

[9]  Barton P. Miller,et al.  Learning to Analyze Binary Computer Code , 2008, AAAI.

[10]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[11]  Anh Tuan Nguyen,et al.  Lexical statistical machine translation for language migration , 2013, ESEC/FSE 2013.

[12]  Gerry Kane,et al.  MIPS RISC Architecture , 1987 .

[13]  Dawn Xiaodong Song,et al.  Towards Synthesizing Complex Programs From Input-Output Examples , 2017, ICLR.

[14]  Ke Wang,et al.  Dynamic Neural Program Embedding for Program Repair , 2017, ICLR.

[15]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[16]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[17]  David Brumley,et al.  BAP: A Binary Analysis Platform , 2011, CAV.

[18]  Mirella Lapata,et al.  Language to Logical Form with Neural Attention , 2016, ACL.

[19]  Alexander Aiken,et al.  Stochastic superoptimization , 2012, ASPLOS '13.

[20]  Dawn Song,et al.  Execution-Guided Neural Program Synthesis , 2018, ICLR.

[21]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[22]  Jonathon T. Giffin,et al.  2011 IEEE Symposium on Security and Privacy Virtuoso: Narrowing the Semantic Gap in Virtual Machine Introspection , 2022 .

[23]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[24]  Eric Schulte,et al.  Using recurrent neural networks for decompilation , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[25]  David Brumley,et al.  Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring , 2013, USENIX Security Symposium.

[26]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[27]  Christopher Krügel,et al.  Effective and Efficient Malware Detection at the End Host , 2009, USENIX Security Symposium.

[28]  Cristina Cifuentes,et al.  Reverse compilation techniques , 1994 .

[29]  Mike Van Emmerik,et al.  Using a decompiler for real-world source recovery , 2004, 11th Working Conference on Reverse Engineering.

[30]  Henry S. Warren,et al.  Hacker's Delight , 2002 .

[31]  David Brumley,et al.  TIE: Principled Reverse Engineering of Types in Binary Programs , 2011, NDSS.

[32]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.