GenTAL: Generative Denoising Skip-gram Transformer for Unsupervised Binary Code Similarity Detection

Binary code similarity detection serves a critical role in cybersecurity. It alleviates the huge manual effort required in the reverse engineering process for malware analysis and vulnerability detection, where the original source code is often not available. Most of the existing solutions focus on a manual feature engineering process and customized code matching algorithms that are inefficient and inaccurate. Recent deep learning-based solutions embed the semantics of binary code into a latent space through supervised contrastive learning. However, one cannot cover all the possible forms in the training set to learn the variance of the same semantics. In this paper, we propose an unsupervised model aiming to learn the intrinsic representation of assembly code semantics. Specifically, we propose a Transformer-based auto-encoder like language model for the low-level assembly code grammar to capture the abstract semantic representation. By coupling a Transformer encoder and a skip-gram style loss design, it can learn a compact representation that is robust against different compilation options. We conduct experiments on four different block-level code similarity tasks. It shows that our method is more robust compared to the state-of-the-art solutions.

[1]  Rui Ma,et al.  BinDeep: A deep learning approach to binary code similarity detection , 2021, Expert Syst. Appl..

[2]  Junzhou Huang,et al.  Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection , 2020, AAAI.

[3]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[4]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[5]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[6]  Giuseppe Antonio Di Luna,et al.  SAFE: Self-Attentive Function Embeddings for Binary Similarity , 2018, DIMVA.

[7]  Chao Zhang,et al.  $\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[8]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[9]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[10]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[11]  Lingyu Wang,et al.  BinShape: Scalable and Robust Binary Library Function Identification Using Function Shape , 2017, DIMVA.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Mourad Debbabi,et al.  BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables , 2017, SEC.

[14]  Juanru Li,et al.  Binary Code Clone Detection across Architectures and Compiling Configurations , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[15]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[16]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[17]  Benjamin C. M. Fung,et al.  Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering , 2016, KDD.

[18]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[19]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[20]  Pascal Junod,et al.  Obfuscator-LLVM -- Software Protection for the Masses , 2015, 2015 IEEE/ACM 1st International Workshop on Software Protection.

[21]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[22]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[23]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[24]  Andy King,et al.  BinSlayer: accurate comparison of binary executables , 2013, PPREW '13.

[25]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[26]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[27]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  J. Tarter,et al.  Detection , 2021, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[30]  Ki-Woong Park,et al.  Learning Binary Code with Deep Learning to Detect Software Weakness , 2017 .

[31]  T. Dullien,et al.  Graph-based comparison of Executable Objects ( English Version ) , 2005 .