Neural Nets Can Learn Function Type Signatures From Binaries

Function type signatures are important for binary analysis, but they are not available in COTS binaries. In this paper, we present a new system called EKLAVYA which trains a recurrent neural network to recover function type signatures from disassembled binary code. EKLAVYA assumes no knowledge of the target instruction set semantics to make such inference. More importantly, EKLAVYA results are “explicable”: we find by analyzing its model that it auto-learns relationships between instructions, compiler conventions, stack frame setup instructions, use-before-write patterns, and operations relevant to identifying types directly from binaries. In our evaluation on Linux binaries compiled with clang and gcc, for two different architectures (x86 and x64), EKLAVYA exhibits accuracy of around 84% and 81% for function argument count and type recovery tasks respectively. EKLAVYA generalizes well across the compilers tested on two different instruction sets with various optimization levels, without any specialized prior knowledge of the instruction set, compiler or optimization level.

[1]  Mingwei Zhang,et al.  Control Flow Integrity for COTS Binaries , 2013, USENIX Security Symposium.

[2]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[3]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[4]  David J. Musliner,et al.  Automatically Repairing Stripped Executables with CFG Microsurgery , 2015, 2015 IEEE International Conference on Self-Adaptive and Self-Organizing Systems Workshops.

[5]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[6]  Chao Zhang,et al.  Practical Control Flow Integrity and Randomization for Binary Executables , 2013, 2013 IEEE Symposium on Security and Privacy.

[7]  Heng Yin,et al.  vfGuard: Strict Protection for Virtual Function Calls in COTS C++ Binaries , 2015, NDSS.

[8]  Alva Erwin,et al.  Analysis of Machine learning Techniques Used in Behavior-Based Malware Detection , 2010, 2010 Second International Conference on Advances in Computing, Control, and Telecommunication Technologies.

[9]  Stephen McCamant,et al.  Loop-extended symbolic execution on binary programs , 2009, ISSTA.

[10]  Xi Chen,et al.  An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries , 2016, USENIX Security Symposium.

[11]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[12]  Michael D. Ernst,et al.  Automatically patching errors in deployed software , 2009, SOSP '09.

[13]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[14]  Xi Chen,et al.  A Tough Call: Mitigating Advanced Code-Reuse Attacks at the Binary Level , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[17]  Christopher Krügel,et al.  Automating Mimicry Attacks Using Static Binary Analysis , 2005, USENIX Security Symposium.

[18]  Christopher D. Manning,et al.  Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks , 2010 .

[19]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[20]  R. Sekar,et al.  Effective Function Recovery for COTS Binaries using Interface Verification , 2016 .

[21]  Barton P. Miller,et al.  Learning to Analyze Binary Computer Code , 2008, AAAI.

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[24]  George Candea,et al.  S2E: a platform for in-vivo multi-path analysis of software systems , 2011, ASPLOS XVI.

[25]  Somesh Jha,et al.  Semantics-aware malware detection , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[26]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[27]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[28]  Zhenkai Liang,et al.  BitBlaze: A New Approach to Computer Security via Binary Analysis , 2008, ICISS.

[29]  Westley Weimer,et al.  Repairing COTS Router Firmware without Access to Source Code or Test Suites: A Case Study in Evolutionary Software Repair , 2015, GECCO.

[30]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[31]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[32]  David Brumley,et al.  TIE: Principled Reverse Engineering of Types in Binary Programs , 2011, NDSS.

[33]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[34]  Thomas E. Anderson,et al.  SLIC: An Extensibility System for Commodity Operating Systems , 1998, USENIX Annual Technical Conference.

[35]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[36]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[37]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[38]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[39]  Kevin W. Hamlen,et al.  Securing untrusted code via compiler-agnostic binary rewriting , 2012, ACSAC '12.

[40]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[41]  Christopher Krügel,et al.  Detecting kernel-level rootkits through binary analysis , 2004, 20th Annual Computer Security Applications Conference.

[42]  Karl Trygve Kalleberg,et al.  Finding software license violations through binary code clone detection , 2011, MSR '11.

[43]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[44]  R. Sekar,et al.  Efficient fine-grained binary instrumentationwith applications to taint-tracking , 2008, CGO '08.

[45]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[46]  Yuval Elovici,et al.  Unknown Malcode Detection Using OPCODE Representation , 2008, EuroISI.

[47]  Rajeev Barua,et al.  Scalable variable and data type detection in a binary rewriter , 2013, PLDI.

[48]  Stephen McCamant,et al.  Evaluating SFI for a CISC Architecture , 2006, USENIX Security Symposium.

[49]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .