BinComp: A stratified approach to compiler provenance Attribution

Compiler provenance encompasses numerous pieces of information, such as the compiler family, compiler version, optimization level, and compiler-related functions. The extraction of such information is imperative for various binary analysis applications, such as function fingerprinting, clone detection, and authorship attribution. It is thus important to develop an efficient and automated approach for extracting compiler provenance. In this study, we present BinComp, a practical approach which, analyzes the syntax, structure, and semantics of disassembled functions to extract compiler provenance. BinComp has a stratified architecture with three layers. The first layer applies a supervised compilation process to a set of known programs to model the default code transformation of compilers. The second layer employs an intersection process that disassembles functions across compiled binaries to extract statistical features (e.g., numerical values) from common compiler/linker-inserted functions. This layer labels the compiler-related functions. The third layer extracts semantic features from the labeled compiler-related functions to identify the compiler version and the optimization level. Our experimental results demonstrate that BinComp is efficient in terms of both computational resources and time.

[1]  Barton P. Miller,et al.  Extracting compiler provenance from program binaries , 2010, PASTE '10.

[2]  Mourad Debbabi,et al.  RESource: A Framework for Online Matching of Assembly with Open Source Code , 2012, FPS.

[3]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[4]  Konrad Rieck,et al.  Structural detection of android malware using embedded call graphs , 2013, AISec.

[5]  Arun Lakhotia,et al.  Identifying Shared Software Components to Support Malware Forensics , 2014, DIMVA.

[6]  Hisashi Kashima,et al.  A Linear-Time Graph Kernel , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[7]  Barton P. Miller,et al.  Recovering the toolchain provenance of binary code , 2011, ISSTA '11.

[8]  Zaharije Radivojevic,et al.  Approach for estimating similarity between procedures in differently compiled binaries , 2015, Inf. Softw. Technol..

[9]  Lingyu Wang,et al.  SIGMA: A Semantic Integrated Graph Matching Approach for identifying reused functions in binary code , 2015, Digit. Investig..

[10]  Barton P. Miller,et al.  Who Wrote This Code? Identifying the Authors of Program Binaries , 2011, ESORICS.

[11]  Mark Stamp,et al.  Chi-squared distance and metamorphic virus detection , 2013, Journal of Computer Virology and Hacking Techniques.

[12]  Benjamin C. M. Fung,et al.  BinClone: Detecting Code Clones in Malware , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[13]  Thomas W. Reps,et al.  WYSINWYX: What you see is not what you eXecute , 2005, TOPL.

[14]  Stefano Zanero,et al.  Lines of malicious code: insights into the malicious software industry , 2012, ACSAC '12.

[15]  Lingyu Wang,et al.  OBA2: An Onion approach to Binary code Authorship Attribution , 2014, Digit. Investig..

[16]  Barton P. Miller,et al.  Labeling library functions in stripped binaries , 2011, PASTE '11.

[17]  Björn Franke,et al.  Exploiting function similarity for code size reduction , 2014, LCTES '14.