BinDNN: Resilient Function Matching Using Deep Learning

Determining if two functions taken from different compiled binaries originate from the same function in the source code has many applications to malware reverse engineering. Namely, this process allows an analyst to filter large swaths of code, removing functions that have been previously observed or those that originate in shared or trusted libraries. However, this task is challenging due to the myriad factors that influence the translation between source code and assembly instructions—the instruction stream created by a compiler is heavily influenced by a number of factors including optimizations, target platforms, and runtime constraints. In this paper, we seek to advance methods for reliably testing the equivalence of functions found in different executables. By leveraging advances in deep learning and natural language processing, we design and evaluate a novel algorithm, BinDNN, that is resilient to variations in compiler, compiler optimization level, and architecture. We show that BinDNN is effective both in isolation or in conjunction with existing approaches. In the case of the latter, we boost performance by 109% when combining BinDNN with BinDiff to compare functions across architectures. This result—an improvement of 32% for BinDNN and 185% for BinDiff—demonstrates the utility of employing multiple orthogonal approaches to function matching.

[1]  Mark Lindsey,et al.  Detecting malicious Android applications from runtime behavior , 2015, MILCOM 2015 - 2015 IEEE Military Communications Conference.

[2]  Barton P. Miller,et al.  Learning to Analyze Binary Computer Code , 2008, AAAI.

[3]  Massimiliano Pontil,et al.  Support Vector Machines for 3D Object Recognition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Sara Matzner,et al.  An application of machine learning to network intrusion detection , 1999, Proceedings 15th Annual Computer Security Applications Conference (ACSAC'99).

[5]  Debin Gao,et al.  BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.

[6]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[7]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[8]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[9]  Barton P. Miller,et al.  Labeling library functions in stripped binaries , 2011, PASTE '11.

[10]  Andrew H. Sung,et al.  Intrusion detection using neural networks and support vector machines , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[11]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[12]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[13]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Ali A. Ghorbani,et al.  Detecting P2P botnets through network behavior analysis and machine learning , 2011, 2011 Ninth Annual International Conference on Privacy, Security and Trust.

[16]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[19]  Andy King,et al.  BinSlayer: accurate comparison of binary executables , 2013, PPREW '13.

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[22]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.