LmPa: Improving Decompilation by Synergy of Large Language Model and Program Analysis

Decompilation aims to recover the source code form of a binary executable. It has many applications in security and software engineering such as malware analysis, vulnerability detection and code reuse. A prominent challenge in decompilation is to recover variable names. We propose a novel method that leverages the synergy of large language model (LLM) and program analysis. Language models encode rich multi-modal knowledge, but its limited input size prevents providing sufficient global context for name recovery. We propose to divide the task to many LLM queries and use program analysis to correlate and propagate the query results, which in turn improves the performance of LLM by providing additional contextual information. Our results show that 75% of the recovered names are considered good by users and our technique outperforms the state-of-the-art technique by 16.5% and 20.23% in precision and recall, respectively.

[1]  X. Zhang,et al.  D-ARM: Disassembling ARM Binaries by Lightweight Superset Instruction Interpretation and Graph Modeling , 2023, IEEE Symposium on Security and Privacy.

[2]  Nan Jiang,et al.  Impact of Code Language Models on Automated Program Repair , 2023, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[3]  Brendan Dolan-Gavitt,et al.  Beyond the C: Retargetable Decompilation using Neural Machine Translation , 2022, Proceedings 2022 Workshop on Binary Analysis Research.

[4]  Cuiyun Gao,et al.  No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence , 2022, ESEC/SIGSOFT FSE.

[5]  Hao Wang,et al.  jTrans: jump-aware transformer for binary code similarity detection , 2022, ISSTA.

[6]  Sida I. Wang,et al.  InCoder: A Generative Model for Code Infilling and Synthesis , 2022, ICLR.

[7]  Yangyang Shi,et al.  Gadgets Splicing: Dynamic Binary Transformation for Precise Rewriting , 2022, IEEE/ACM International Symposium on Code Generation and Optimization.

[8]  Sufyan bin Uzayr GitHub , 2022, Mastering Git.

[9]  Ruigang Liang,et al.  Semantics-Recovering Decompilation through Neural Machine Translation , 2021, ArXiv.

[10]  Jie Wang,et al.  A Novel Method for Detecting Advanced Persistent Threat Attack Based on Belief Rule Base , 2021, Applied Sciences.

[11]  Ali Kashif Bashir,et al.  Securing Critical Infrastructures: Deep-Learning-Based Threat Detection in IIoT , 2021, IEEE Communications Magazine.

[12]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[13]  S. Jana,et al.  StateFormer: fine-grained type recovery from binaries using generative state modeling , 2021, ESEC/SIGSOFT FSE.

[14]  Graham Neubig,et al.  Augmenting Decompiler Output with Learned Variable Names and Types , 2021, USENIX Security Symposium.

[15]  Guan Gui,et al.  Federated Deep Learning for Zero-Day Botnet Attack Detection in IoT-Edge Devices , 2021, IEEE Internet of Things Journal.

[16]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[17]  Wei You,et al.  OSPREY: Recovery of Variable and Data Structure via Probabilistic Analysis for Stripped Binary , 2021, 2021 IEEE Symposium on Security and Privacy (SP).

[18]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[19]  S. Jana,et al.  Trex: Learning Execution Semantics from Micro-Traces for Binary Similarity , 2020, ArXiv.

[20]  Yongdae Kim,et al.  Revisiting Binary Code Similarity Analysis Using Interpretable Feature Engineering and Lessons Learned , 2020, IEEE Transactions on Software Engineering.

[21]  Suman Jana,et al.  XDA: Accurate, Robust Disassembly with Transfer Learning , 2020, NDSS.

[22]  Philippe Ombredanne,et al.  Free and Open Source Software License Compliance: Tools for Software Composition Analysis , 2020, Computer.

[23]  Jun Xu,et al.  SoK: All You Ever Wanted to Know About x86/x64 Binary Disassembly But Were Afraid to Ask , 2020, 2021 IEEE Symposium on Security and Privacy (SP).

[24]  Yajin Zhou,et al.  An empirical study on ARM disassembly tools , 2020, ISSTA.

[25]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[26]  Mathias Payer,et al.  RetroWrite: Statically Instrumenting COTS Binaries for Fuzzing and Sanitization , 2020, 2020 IEEE Symposium on Security and Privacy (SP).

[27]  Xiangyu Zhang,et al.  PMP: Cost-effective Forced Execution with Probabilistic Memory Pre-planning , 2020, 2020 IEEE Symposium on Security and Privacy (SP).

[28]  Junzhou Huang,et al.  Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection , 2020, AAAI.

[29]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[30]  Wei You,et al.  BDA: practical dependence analysis for binary executables by unbiased whole-program path sampling and per-path abstract interpretation , 2019, Proc. ACM Program. Lang..

[31]  Graham Neubig,et al.  DIRE: A Neural Approach to Decompiled Identifier Naming , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[32]  Dawu Gu,et al.  A Semantics-Based Hybrid Approach on Binary Code Similarity Comparison , 2019, IEEE Transactions on Software Engineering.

[33]  Konrad Rieck,et al.  TypeMiner: Recovering Types in Binary Programs Using Machine Learning , 2019, DIMVA.

[34]  Eric M. Schulte,et al.  Datalog Disassembly , 2019, USENIX Security Symposium.

[35]  Yi Sun,et al.  Probabilistic Disassembly , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[36]  Nour Moustafa,et al.  Forensics and Deep Learning Mechanisms for Botnets in Internet of Things: A Survey of Challenges and Solutions , 2019, IEEE Access.

[37]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[38]  Giuseppe Antonio Di Luna,et al.  SAFE: Self-Attentive Function Embeddings for Binary Similarity , 2018, DIMVA.

[39]  Petar Tsankov,et al.  Debin: Predicting Debug Information in Stripped Binaries , 2018, CCS.

[40]  Yu Jiang,et al.  VulSeeker: A Semantic Learning Based Vulnerability Seeker for Cross-Platform Binary , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[41]  Xiaopeng Li,et al.  Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs , 2018, NDSS.

[42]  Claire Le Goues,et al.  Suggesting meaningful variable names for decompiled code: a machine translation approach , 2017, ESEC/SIGSOFT FSE.

[43]  Dinghao Wu,et al.  In-memory fuzzing for binary code similarity analysis , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[44]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[45]  Christopher Krügel,et al.  TriggerScope: Towards Detecting Logic Bombs in Android Applications , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[46]  Leyla Bilge,et al.  Cutting the Gordian Knot: A Look Under the Hood of Ransomware Attacks , 2015, DIMVA.

[47]  Christopher Krügel,et al.  Hulk: Eliciting Malicious Behavior in Browser Extensions , 2014, USENIX Security Symposium.

[48]  David Brumley,et al.  Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring , 2013, USENIX Security Symposium.

[49]  Leyla Bilge,et al.  Disclosure: detecting botnet command and control servers through large-scale NetFlow analysis , 2012, ACSAC '12.

[50]  Karl Trygve Kalleberg,et al.  Finding software license violations through binary code clone detection , 2011, MSR '11.

[51]  David Brumley,et al.  TIE: Principled Reverse Engineering of Types in Binary Programs , 2011, NDSS.

[52]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[53]  Thomas W. Reps,et al.  Analyzing Memory Accesses in x86 Executables , 2004, CC.

[54]  Christopher Krügel,et al.  Decomperson: How Humans Decompile and What We Can Learn From It , 2022, USENIX Security Symposium.

[55]  S. Savarese,et al.  A Conversational Paradigm for Program Synthesis , 2022, ArXiv.

[56]  Mariano Graziano,et al.  How Machine Learning Is Solving the Binary Function Similarity Problem , 2022, USENIX Security Symposium.

[57]  Xunchao Hu,et al.  DeepDi: Learning a Relational Graph Convolutional Network Model on Instructions for Fast and Accurate Disassembly , 2022, USENIX Security Symposium.

[58]  D. Balzarotti,et al.  RE-Mind: a First Look Inside the Mind of a Reverse Engineer , 2022, USENIX Security Symposium.

[59]  Matthew Hicks,et al.  Breaking Through Binaries: Compiler-quality Instrumentation for Better Binary-only Fuzzing , 2021, USENIX Security Symposium.

[60]  G. Kaiser,et al.  DIRECT : A Transformer-based Model for Decompiled Identifier Renaming , 2021, NLP4PROG.

[61]  Abdulellah A. Alsaheel,et al.  ATLAS: A Sequence-based Learning Approach for Attack Investigation , 2021, USENIX Security Symposium.

[62]  Cho Do Xuan,et al.  APT attack detection based on flow network analysis techniques using deep learning , 2020, J. Intell. Fuzzy Syst..

[63]  Zhen Ma,et al.  Similarity Metric Method for Binary Basic Blocks of Cross-Instruction Set Architecture , 2020 .

[64]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[65]  Yuandong Tian,et al.  Coda: An End-to-End Neural Program Decompiler , 2019, NeurIPS.

[66]  Xuezixiang Li,et al.  Learning Program-Wide Code Representations for Binary Diffing , 2019, NDSS.

[67]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[68]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[69]  Zhenkai Liang,et al.  Neural Nets Can Learn Function Type Signatures From Binaries , 2017, USENIX Security Symposium.

[70]  Christian S. Collberg,et al.  Software watermarking: models and dynamic embeddings , 1999, POPL '99.