Towards Learning Representations of Binary Executable Files for Security Tasks

Tackling binary analysis problems has traditionally implied manually defining rules and heuristics. As an alternative, we are suggesting using machine learning models for learning distributed representations of binaries that can be applicable for a number of downstream tasks. We construct a computational graph from the binary executable and use it with a graph convolutional neural network to learn a high dimensional representation of the program. We show the versatility of this approach by using our representations to solve two semantically different binary analysis tasks -- algorithm classification and vulnerability discovery. We compare the proposed approach to our own strong baseline as well as published results and demonstrate improvement on the state of the art methods for both tasks.

[1]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[2]  Akbar Siami Namin,et al.  Predicting Vulnerable Software Components through N-Gram Analysis and Statistical Feature Selection , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[3]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[4]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[5]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[6]  Claudia Eckert,et al.  Deep Learning for Classification of Malware System Call Sequences , 2016, Australasian Conference on Artificial Intelligence.

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  Pushmeet Kohli,et al.  Graph Matching Networks for Learning the Similarity of Graph Structured Objects , 2019, ICML.

[9]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[10]  Alexander Pretschner,et al.  Robust and Effective Malware Detection Through Quantitative Data Flow Graph Metrics , 2015, DIMVA.

[11]  Petar Tsankov,et al.  Debin: Predicting Debug Information in Stripped Binaries , 2018, CCS.

[12]  Xi Chen,et al.  An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries , 2016, USENIX Security Symposium.

[13]  Shouhuai Xu,et al.  VulDeePecker: A Deep Learning-Based System for Vulnerability Detection , 2018, NDSS.

[14]  Laurie A. Williams,et al.  Approximating Attack Surfaces with Stack Traces , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[15]  Xun Li,et al.  Effective detection of android malware based on the usage of data flow APIs and machine learning , 2016, Inf. Softw. Technol..

[16]  Jin Kwak,et al.  Automatic malware mutant detection and group classification based on the n-gram and clustering coefficient , 2015, The Journal of Supercomputing.

[17]  Jun Zhang,et al.  POSTER: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects , 2017, CCS.

[18]  Katerina Goseva-Popstojanova,et al.  On the capability of static code analysis to detect security vulnerabilities , 2015, Inf. Softw. Technol..

[19]  Premkumar T. Devanbu,et al.  On the "naturalness" of buggy code , 2015, ICSE.

[20]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[21]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[22]  Rajeev R. Raje,et al.  Towards modeling the behavior of static code analysis tools , 2014, CISR '14.

[23]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[24]  Guillermo L. Grinblat,et al.  Toward Large-Scale Vulnerability Discovery using Machine Learning , 2016, CODASPY.

[25]  Junliang Yao,et al.  MDBA: Detecting Malware based on Bytes N-Gram with Association Mining , 2019, 2019 26th International Conference on Telecommunications (ICT).

[26]  Torsten Hoefler,et al.  Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[27]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[28]  Ayse Basar Bener,et al.  Mining trends and patterns of software vulnerabilities , 2016, J. Syst. Softw..

[29]  Paul E. Black,et al.  Juliet 1.1 C/C++ and Java Test Suite , 2012, Computer.

[30]  Xiaopeng Li,et al.  Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs , 2018, NDSS.