ELISA: ELiciting ISA of Raw Binaries for Fine-Grained Code and Data Separation

Static binary analysis techniques are widely used to reconstruct the behavior and discover vulnerabilities in software when source code is not available. To avoid errors due to mis-interpreting data as machine instructions (or vice-versa), disassemblers and static analysis tools must precisely infer the boundaries between code and data. However, this information is often not readily available. Worse, compilers may embed small chunks of data inside the code section. Most state of the art approaches to separate code and data are rooted on recursive traversal disassembly, with severe limitations when dealing with indirect control instructions. We propose ELISA, a technique to separate code from data and ease the static analysis of executable files. ELISA leverages supervised sequential learning techniques to locate the code section(s) boundaries of header-less binary files, and to predict the instruction boundaries inside the identified code section. As a preliminary step, if the Instruction Set Architecture (ISA) of the binary is unknown, ELISA leverages a logistic regression model to identify the correct ISA from the file content. We provide a comprehensive evaluation on a dataset of executables compiled for different ISAs, and we show that our method is capable to identify code sections with a byte-level accuracy (F1 score) ranging from \(98.13\%\) to over \(99.9\%\) depending on the ISA. Fine-grained separation of code from embedded data on x86, x86-64 and ARM executables is accomplished with an accuracy of over \(99.9\%\).

[1]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[2]  Christopher Krügel,et al.  Static Disassembly of Obfuscated Binaries , 2004, USENIX Security Symposium.

[3]  Wuu Yang,et al.  Effective code discovery for ARM/Thumb mixed ISA binaries in a static binary translator , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[4]  Nikos Karampatziakis Static Analysis of Binary Executables Using Structural SVMs , 2010, NIPS.

[5]  Barton P. Miller,et al.  Learning to Analyze Binary Computer Code , 2008, AAAI.

[6]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[7]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[8]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[9]  David Brumley,et al.  BAP: A Binary Analysis Platform , 2011, CAV.

[10]  Christopher Krügel,et al.  Driller: Augmenting Fuzzing Through Selective Symbolic Execution , 2016, NDSS.

[11]  Zhenkai Liang,et al.  BitBlaze: A New Approach to Computer Security via Binary Analysis , 2008, ICISS.

[12]  Stefano Zanero,et al.  Context-Based File Block Classification , 2012, IFIP Int. Conf. Digital Forensics.

[13]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[14]  Giovanni Vigna,et al.  Static Detection of Vulnerabilities in x86 Executables , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[15]  John Clemens Automatic classification of object code using machine learning , 2015, Digit. Investig..

[16]  Michael J. Eager Introduction to the DWARF Debugging Format , 2007 .

[17]  Sven Behnke,et al.  PyStruct: learning structured prediction in python , 2014, J. Mach. Learn. Res..

[18]  William J. Buchanan,et al.  Approaches to the classification of high entropy file fragments , 2013, Digit. Investig..

[19]  Herbert Bos,et al.  Dowsing for Overflows: A Guided Fuzzer to Find Buffer Boundary Violations , 2013, USENIX Security Symposium.

[20]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[21]  Saumya K. Debray,et al.  Obfuscation of executable code to improve resistance to static disassembly , 2003, CCS '03.

[22]  Christopher Krügel,et al.  Firmalice - Automatic Detection of Authentication Bypass Vulnerabilities in Binary Firmware , 2015, NDSS.

[23]  Christopher Krügel,et al.  DIFUZE: Interface Aware Fuzzing for Kernel Drivers , 2017, CCS.

[24]  Bhavani M. Thuraisingham,et al.  Differentiating Code from Data in x86 Binaries , 2011, ECML/PKDD.

[25]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[26]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[27]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[28]  Xi Chen,et al.  An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries , 2016, USENIX Security Symposium.

[29]  Herbert Bos,et al.  Compiler-Agnostic Function Detection in Binaries , 2017, 2017 IEEE European Symposium on Security and Privacy (EuroS&P).