FuncNet: A Euclidean Embedding Approach for Lightweight Cross-platform Binary Recognition

Reverse analysis is a necessary but manually dependent technique to comprehend the working principle of new malware. The cross-platform binary recognition facilitates the work of reverse engineers by identifying those duplicated or known parts compiled from various platforms. However, existing approaches mainly rely on raw function bytes or cosine embedding representation, which have either low binary recognition accuracy or high binary search overheads on real-world binary recognition tasks. In this paper, we propose a lightweight neural network-based approach to generate the Euclidean embedding (i.e., a numeric vector), based on the control flow graph and callee’s interface information of each binary function, and classify the embedding vectors with an Euclidean distance sensitive artificial neural network. We implement a prototype called FuncNet, and evaluate it on real-world projects with 1980 binaries, about 2 million function pairs. The experiment result shows that its accuracy outperforms state-of-the-art solutions by over 13% on average and the binary search on big datasets can be done with constant time complexity.

[1]  Chao Zhang,et al.  $\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[2]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[3]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[4]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[5]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[6]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[7]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[8]  Silvio Savarese,et al.  Deep Metric Learning via Lifted Structured Feature Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Sencun Zhu,et al.  Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection , 2014, SIGSOFT FSE.

[10]  Le Song,et al.  Discriminative Embeddings of Latent Variable Models for Structured Data , 2016, ICML.

[11]  Andreas Saebjornsen,et al.  Detecting Fine-Grained Similarity in Binaries , 2014 .

[12]  Sencun Zhu,et al.  Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection , 2017, IEEE Transactions on Software Engineering.

[13]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[14]  David Brumley,et al.  Automatic Patch-Based Exploit Generation is Possible: Techniques and Implications , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[15]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[16]  Joshua Saxe,et al.  Malware Similarity Identification Using Call Graph Based System Call Subsequence Features , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops.

[17]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[18]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xin Yang,et al.  VulSeeker-pro: enhanced semantic learning based binary vulnerability seeker with emulation , 2018, ESEC/SIGSOFT FSE.

[20]  Benjamin C. M. Fung,et al.  Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering , 2016, KDD.

[21]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[22]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Juanru Li,et al.  Binary Code Clone Detection across Architectures and Compiling Configurations , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[24]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.