A Transformer-based Function Symbol Name Inference Model from an Assembly Language for Binary Reversing

Reverse engineering of a stripped binary has a wide range of applications, yet it is challenging mainly due to the lack of contextually useful information within. Once debugging symbols (e.g., variable names, types, function names) are discarded, recovering such information is not technically viable with traditional approaches like static or dynamic binary analysis. We focus on a function symbol name recovery, which allows a reverse engineer to gain a quick overview of an unseen binary. The key insight is that a well-developed program labels a meaningful function name that describes its underlying semantics well. In this paper, we present AsmDepictor, the Transformer-based framework that generates a function symbol name from a set of assembly codes (i.e., machine instructions), which consists of three major components: binary code refinement, model training, and inference. To this end, we conduct systematic experiments on the effectiveness of code refinement that can enhance an overall performance. We introduce the per-layer positional embedding and Unique-softmax for AsmDepictor so that both can aid to capture a better relationship between tokens. Lastly, we devise a novel evaluation metric tailored for a short description length, the Jaccard* score. Our empirical evaluation shows that the performance of AsmDepictor by far surpasses that of the state-of-the-art models up to around 400%. The best AsmDepictor model achieves an F1 of 71.5 and Jaccard* of 75.4.

[1]  Jun Yeon Won,et al.  SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings , 2022, CCS.

[2]  E. Strickland Andrew Ng, AI Minimalist: The Machine-Learning Pioneer Says Small is the New Big , 2022, IEEE Spectrum.

[3]  Taesoo Kim,et al.  A Look Back on a Function Identification Problem , 2021, ACSAC.

[4]  Graham Neubig,et al.  Augmenting Decompiler Output with Learned Variable Names and Types , 2021, USENIX Security Symposium.

[5]  Weiming Zhang,et al.  A lightweight framework for function name reassignment based on large-scale stripped binaries , 2021, ISSTA.

[6]  Taesoo Kim,et al.  Semantic-aware Binary Code Representation with BERT , 2021, ArXiv.

[7]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[8]  Kazuki Irie,et al.  Linear Transformers Are Secretly Fast Weight Programmers , 2021, ICML.

[9]  Xuezixiang Li,et al.  PalmTree: Learning an Assembly Language Model for Instruction Embedding , 2021, CCS.

[10]  S. Jana,et al.  Trex: Learning Execution Semantics from Micro-Traces for Binary Similarity , 2020, ArXiv.

[11]  Eran Yahav,et al.  Neural reverse engineering of stripped binaries using augmented control flow graphs , 2020, Proc. ACM Program. Lang..

[12]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[13]  Andrea Janes,et al.  Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[14]  Peng Liu,et al.  Using deep learning to solve computer security challenges: a survey , 2019, Cybersecurity.

[15]  Jim Alves-Foss,et al.  Function boundary detection in stripped binaries , 2019, ACSAC.

[16]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[17]  Xin Xia,et al.  Code Generation as a Dual Task of Code Summarization , 2019, NeurIPS.

[18]  Irfan Ul Haq,et al.  A Survey of Binary Code Similarity , 2019, ACM Comput. Surv..

[19]  Graham Neubig,et al.  DIRE: A Neural Approach to Decompiled Identifier Naming , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[20]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[21]  Konrad Rieck,et al.  TypeMiner: Recovering Types in Binary Programs Using Machine Learning , 2019, DIMVA.

[22]  Eran Yahav,et al.  Towards Neural Decompilation , 2019, ArXiv.

[23]  Guru Venkataramani,et al.  Machine Learning-Based Analysis of Program Binaries: A Comprehensive Study , 2019, IEEE Access.

[24]  Hermann Ney,et al.  Language Modeling with Deep Transformers , 2019, INTERSPEECH.

[25]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[26]  T. Arnold,et al.  Introduction to Natural Language Processing , 2016, DH.

[27]  E. Im,et al.  Binary executable file similarity calculation using function matching , 2019, The Journal of Supercomputing.

[28]  Giuseppe Antonio Di Luna,et al.  SAFE: Self-Attentive Function Embeddings for Binary Similarity , 2018, DIMVA.

[29]  Petar Tsankov,et al.  Debin: Predicting Debug Information in Stripped Binaries , 2018, CCS.

[30]  Chao Zhang,et al.  $\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[31]  Yu Jiang,et al.  VulSeeker: A Semantic Learning Based Vulnerability Seeker for Cross-Platform Binary , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[32]  Xiaopeng Li,et al.  Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs , 2018, NDSS.

[33]  Lingyu Wang,et al.  BINARM: Scalable and Efficient Detection of Vulnerabilities in Firmware Images of Intelligent Electronic Devices , 2018, DIMVA.

[34]  Eric Schulte,et al.  Using recurrent neural networks for decompilation , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[35]  Dinghao Wu,et al.  Semantics-Aware Machine Learning for Function Recognition in Binary Code , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[36]  Hamid Reza Shahriari,et al.  Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques , 2017, ACM Comput. Surv..

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Herbert Bos,et al.  Compiler-Agnostic Function Detection in Binaries , 2017, 2017 IEEE European Symposium on Security and Privacy (EuroS&P).

[39]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[40]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[41]  Eran Yahav,et al.  Statistical similarity of binaries , 2016, PLDI.

[42]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[43]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[44]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[45]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[46]  Christian Rossow,et al.  Cross-architecture bug search in binary executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[47]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[48]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[49]  Wanlei Zhou,et al.  Control Flow-Based Malware VariantDetection , 2014, IEEE Transactions on Dependable and Secure Computing.

[50]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[51]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[52]  Kang G. Shin,et al.  MutantX-S: Scalable Malware Clustering Based on Static Features , 2013, USENIX Annual Technical Conference.

[53]  Barton P. Miller,et al.  Recovering the toolchain provenance of binary code , 2011, ISSTA '11.

[54]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[55]  Mattia Monga,et al.  Detecting Self-mutating Malware Using Control-Flow Graph Matching , 2006, DIMVA.

[56]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[57]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[58]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[59]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[60]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[61]  D. Balzarotti,et al.  RE-Mind: a First Look Inside the Mind of a Reverse Engineer , 2022, USENIX Security Symposium.

[62]  JinYeong Bak,et al.  Learning Sequential and Structural Information for Source Code Summarization , 2021, FINDINGS.

[63]  Akira Otsuka,et al.  o-glassesX: Compiler Provenance Recovery with Attention Mechanism from a Short Code Fragment , 2020, Proceedings 2020 Workshop on Binary Analysis Research.

[64]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[65]  Xuezixiang Li,et al.  Learning Program-Wide Code Representations for Binary Diffing , 2019, NDSS.

[66]  Giuseppe Antonio Di Luna,et al.  Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis , 2019, Proceedings 2019 Workshop on Binary Analysis Research.

[67]  Yuandong Tian,et al.  Coda: An End-to-End Neural Program Decompiler , 2019, NeurIPS.

[68]  Wenbo Guo,et al.  DEEPVSA: Facilitating Value-set Analysis with Deep Learning for Postmortem Program Analysis , 2019, USENIX Security Symposium.

[69]  Matt Noonan,et al.  Evolving Exact Decompilation , 2018 .

[70]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[71]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[72]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .