Decomperson: How Humans Decompile and What We Can Learn From It

Human analysts must reverse engineer binary programs as a prerequisite for a number of security tasks, such as vulnerability analysis, malware detection, and firmware re-hosting. Existing studies of human reversers and the processes they follow are limited in size and often use qualitative metrics that require subjective evaluation. In this paper, we reframe the problem of reverse engineering binaries as the problem of perfect decompilation, which is the process of recovering, from a binary program, source code that, when compiled, produces binary code that is identical to the original binary. This gives us a quantitative measure of understanding, and lets us examine the reversing process programmatically. We developed a tool, called DECOMPERSON, that supported a group of reverse engineers during a large-scale security competition designed to collect information about the participants’ reverse engineering process, with the well-defined goal of achieving perfect decompilation. Over 150 people participated, and we collected more than 35,000 code submissions, the largest manual reverse engineering dataset to date. This includes snapshots of over 300 successful perfect decompilation attempts. In this paper, we show how perfect decompilation allows programmatic analysis of such large datasets, providing new insights into the reverse engineering process.

[1]  Brendan Dolan-Gavitt,et al.  Beyond the C: Retargetable Decompilation using Neural Machine Translation , 2022, Proceedings 2022 Workshop on Binary Analysis Research.

[2]  Yan Shoshitaishvili,et al.  The Convergence of Source Code and Binary Vulnerability Discovery -- A Case Study , 2022, AsiaCCS.

[3]  Hui Jun Tay,et al.  Automatically Mitigating Vulnerabilities in x86 Binary Programs via Partially Recompilable Decompilation , 2022, ArXiv.

[4]  Graham Neubig,et al.  Augmenting Decompiler Output with Learned Variable Names and Types , 2021, USENIX Security Symposium.

[5]  Chitta Baral,et al.  Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling , 2021, ArXiv.

[6]  Ruigang Liang,et al.  Neutron: an attention-based neural decompiler , 2021, Cybersecurity.

[7]  T. Scully,et al.  Improving type information inferred by decompilers with supervised machine learning , 2021, ArXiv.

[8]  Giovanni Agosta,et al.  A Comb for Decompiled C Code , 2020, AsiaCCS.

[9]  Katsuro Inoue,et al.  Identifying Compiler and Optimization Options from Binary Code using Deep Learning Approaches , 2020, 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[10]  Giuseppe Antonio Di Luna,et al.  Function Naming in Stripped Binaries Using Neural Networks , 2019, ArXiv.

[11]  Jeffrey S. Foster,et al.  An Observational Investigation of Reverse Engineers' Processes , 2019, USENIX Security Symposium.

[12]  Graham Neubig,et al.  DIRE: A Neural Approach to Decompiled Identifier Naming , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[13]  Konrad Rieck,et al.  TypeMiner: Recovering Types in Binary Programs Using Machine Learning , 2019, DIMVA.

[14]  Eran Yahav,et al.  Towards Neural Decompilation , 2019, ArXiv.

[15]  Jeffrey S. Foster,et al.  An Observational Investigation of Reverse Engineers' Process and Mental Models , 2019, CHI Extended Abstracts.

[16]  Hong Li,et al.  HIMALIA: Recovering Compiler Optimization Levels from Binaries by Deep Learning , 2018, IntelliSys.

[17]  Yanick Fratantonio,et al.  Understanding Linux Malware , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[18]  Claire Le Goues,et al.  Suggesting meaningful variable names for decompiled code: a machine translation approach , 2017, ESEC/SIGSOFT FSE.

[19]  Eric Schulte,et al.  Using recurrent neural networks for decompilation , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[20]  A. Jaffe Suggesting meaningful variable names for decompiled code: a machine translation approach , 2017, ESEC/SIGSOFT FSE.

[21]  Christopher Krügel,et al.  Rise of the HaCRS: Augmenting Autonomous Cyber Reasoning Systems with Human Assistance , 2017, CCS.

[22]  Lingyu Wang,et al.  BinComp: A stratified approach to compiler provenance Attribution , 2015, Digit. Investig..

[23]  Khaled Yakdan,et al.  REcompile: A decompilation framework for static analysis of binaries , 2013, 2013 8th International Conference on Malicious and Unwanted Software: "The Americas" (MALWARE).

[24]  David Brumley,et al.  Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring , 2013, USENIX Security Symposium.

[25]  Barton P. Miller,et al.  Recovering the toolchain provenance of binary code , 2011, ISSTA '11.

[26]  A. V. Chernov,et al.  Automatic reconstruction of data types in the decompilation problem , 2009, Programming and Computer Software.

[27]  Yu Chen,et al.  A New Algorithm for Identifying Loops in Decompilation , 2007, SAS.

[28]  Alan Mycroft,et al.  Type-Based Decompilation (or Program Reconstruction via Type Reconstruction) , 1999, ESOP.

[29]  Hausi A. Müller,et al.  How do program understanding tools affect how programmers understand programs? , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[30]  Anneliese Amschler Andrews,et al.  Identification of Dynamic Comprehension Processes During Large Scale Maintenance , 1996, IEEE Trans. Software Eng..

[31]  Anneliese Amschler Andrews,et al.  Program Comprehension During Software Maintenance and Evolution , 1995, Computer.

[32]  Stanley Letovsky,et al.  Cognitive processes in program comprehension , 1986, J. Syst. Softw..

[33]  D. Balzarotti,et al.  RE-Mind: a First Look Inside the Mind of a Reverse Engineer , 2022, USENIX Security Symposium.

[34]  Akira Otsuka,et al.  o-glassesX: Compiler Provenance Recovery with Attention Mechanism from a Short Code Fragment , 2020, Proceedings 2020 Workshop on Binary Analysis Research.

[35]  Freek Verbeek,et al.  Sound C Code Decompilation for a Subset of x86-64 Binaries , 2020, SEFM.

[36]  N-B REF : A H IGH - FIDELITY D ECOMPILER E XPLOIT - ING P ROGRAMMING S TRUCTURES , 2020 .

[37]  Yuandong Tian,et al.  Coda: An End-to-End Neural Program Decompiler , 2019, NeurIPS.

[38]  Matt Noonan,et al.  Evolving Exact Decompilation , 2018 .

[39]  Khaled Yakdan,et al.  No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations , 2015, NDSS.

[40]  Robert F. Mills,et al.  Understanding how reverse engineers make sense of programs from assembly language representations , 2012 .

[41]  Cristina Cifuentes,et al.  Reverse compilation techniques , 1994 .

[42]  Kate Ehrlich,et al.  Knowledge and processes in the comprehension of computer programs. , 1988 .

[43]  Rainer Koschke,et al.  Journal of Software Maintenance and Evolution: Research and Practice Software Visualization in Software Maintenance, Reverse Engineering, and Re-engineering: a Research Survey , 2022 .