Codee: A Tensor Embedding Scheme for Binary Code Search

Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-K similar matching functions in the repository. Compared with state-of-the-art cross-optimization-level code search schemes, such as Asm2Vec and DeepBinDiff, our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl. Compared with other cross-platform and cross-optimization-level code search schemes, such as Gemini, Safe, the average recall of our method also outperforms others.

[1]  Andrew Walenstein,et al.  Malware phylogeny generation using permutations of code , 2005, Journal in Computer Virology.

[2]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[3]  Nikos D. Sidiropoulos,et al.  Tensors for Data Mining and Data Fusion , 2016, ACM Trans. Intell. Syst. Technol..

[4]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[5]  Sencun Zhu,et al.  Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection , 2017, IEEE Transactions on Software Engineering.

[6]  Christian Rossow,et al.  Leveraging semantic signatures for bug search in binary programs , 2014, ACSAC.

[7]  Giuseppe Antonio Di Luna,et al.  SAFE: Self-Attentive Function Embeddings for Binary Similarity , 2018, DIMVA.

[8]  Laurence T. Yang,et al.  Secure Tensor Decomposition Using Fully Homomorphic Encryption Scheme , 2018, IEEE Transactions on Cloud Computing.

[9]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[10]  Xuezixiang Li,et al.  Learning Program-Wide Code Representations for Binary Diffing , 2019, NDSS.

[11]  Xiaopeng Li,et al.  Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs , 2018, NDSS.

[12]  Misha Elena Kilmer,et al.  Novel Methods for Multilinear Data Completion and De-noising Based on Tensor-SVD , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Xiao Huang,et al.  Accelerated Attributed Network Embedding , 2017, SDM.

[14]  Kevin Chen-Chuan Chang,et al.  A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications , 2017, IEEE Transactions on Knowledge and Data Engineering.

[15]  Lingyu Wang,et al.  BinShape: Scalable and Robust Binary Library Function Identification Using Function Shape , 2017, DIMVA.

[16]  Steven W. Shaw,et al.  Circulant Matrices and Their Application to Vibration Analysis , 2014 .

[17]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.

[18]  Fangfang Zhang,et al.  Deviation-Based Obfuscation-Resilient Program Equivalence Checking With Application to Software Plagiarism Detection , 2016, IEEE Transactions on Reliability.

[19]  Matthew Brand,et al.  Incremental Singular Value Decomposition of Uncertain Data with Missing Values , 2002, ECCV.

[20]  Davide Quarta,et al.  ELISA: ELiciting ISA of Raw Binaries for Fine-Grained Code and Data Separation , 2018, DIMVA.

[21]  Laurence T. Yang,et al.  A Tensor-Based Approach for Big Data Representation and Dimensionality Reduction , 2014, IEEE Transactions on Emerging Topics in Computing.

[22]  Jacques Klein,et al.  Understanding Android App Piggybacking: A Systematic Study of Malicious Code Grafting , 2017, IEEE Transactions on Information Forensics and Security.

[23]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[24]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[25]  Juanru Li,et al.  BinMatch: A Semantics-Based Hybrid Approach on Binary Code Clone Analysis , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[26]  Haoran Yu,et al.  WASTK: A Weighted Abstract Syntax Tree Kernel Method for Source Code Plagiarism Detection , 2017, Sci. Program..

[27]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[28]  Chris H. Q. Ding,et al.  Symmetric Nonnegative Matrix Factorization for Graph Clustering , 2012, SDM.

[29]  Junzhou Huang,et al.  Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection , 2020, AAAI.

[30]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[31]  Zhenkai Liang,et al.  Neural Nets Can Learn Function Type Signatures From Binaries , 2017, USENIX Security Symposium.

[32]  Andy King,et al.  BinSlayer: accurate comparison of binary executables , 2013, PPREW '13.

[33]  Giuseppe Antonio Di Luna,et al.  Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis , 2019, Proceedings 2019 Workshop on Binary Analysis Research.

[34]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[35]  Chao Zhang,et al.  $\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[36]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[37]  Yang Liu,et al.  Accurate and Scalable Cross-Architecture Cross-OS Binary Code Search with Emulation , 2019, IEEE Transactions on Software Engineering.

[38]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[39]  S. Andrews,et al.  Finding the high probabilistic potential fishing zone by accelerated SVM classification , 2017, Int. J. Inf. Commun. Technol..

[40]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[41]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[42]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[43]  Irfan Ul Haq,et al.  A Survey of Binary Code Similarity , 2019, ACM Comput. Surv..