Decoupling coding habits from functionality for effective binary authorship attribution

Binary authorship attribution refers to the process of identifying the author of a given anonymous binary file based on stylistic characteristics. It aims to automate the laborious and error-prone reverse engineering task of discovering information related to the author(s) of a binary code. Existing works typically employ machine learning methods to extract features that are unique for each author and subsequently match them against a given binary to identify the author. However, most existing works share a common critical limitation, i.e., they cannot distinguish between features representing program functionality and those representing authorship (e.g., authors’ coding habits). Such distinction is crucial for effective authorship attribution because what is unique in a particular binary may be attributed to either author, compiler, or function. In this study, we present BINAUTHOR a system capable of decoupling program functionality from authors’ coding habits in binary code. To capture coding habits, BINAUTHOR leverages a set of features that are based on collections of functionality-independent choices made by authors during coding. Our evaluation demonstrates that BINAUTHOR outperforms existing methods in several aspects. First, it successfully attributes a larger number of authors with a significantly higher accuracy (around 90%) based on the large datasets extracted from selected open-source C++ projects in GitHub, Google Code Jam events, Planet Source Code contests, and several programming projects. Second, BINAUTHOR is more robust than previous methods; there is no significant drop in accuracy when the code is subjected to refactoring techniques, simple obfuscation, and processed with different compilers. Finally, decoupling authorship from functionality allows us to apply BINAUTHOR to real malware binaries (Citadel, Zeus, Stuxnet, Flame, Bunny, and Babar) to automatically generate evidence on similar coding habits.

[1]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[2]  Dennis Shasha,et al.  New techniques for extracting features from protein sequences , 2001, IBM Syst. J..

[3]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[4]  Stephen G. MacDonell,et al.  A Fuzzy Logic Approach to Computer Software Source Code Authorship Analysis , 1997, ICONIP.

[5]  Yoseba K. Penya,et al.  Idea: Opcode-Sequence-Based Malware Detection , 2010, ESSoS.

[6]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[7]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Petteri Kaski,et al.  Engineering an Efficient Canonical Labeling Tool for Large and Sparse Graphs , 2007, ALENEX.

[9]  Barton P. Miller,et al.  Recovering the toolchain provenance of binary code , 2011, ISSTA '11.

[10]  Rong Chen,et al.  Author Identification of Software Source Code with Program Dependence Graphs , 2010, 2010 IEEE 34th Annual Computer Software and Applications Conference Workshops.

[11]  Arvind Narayanan,et al.  When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries , 2015, NDSS.

[12]  Saed Alrabaee,et al.  On the feasibility of binary authorship characterization , 2019, Digit. Investig..

[13]  Li Yujian,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Byung Ro Moon,et al.  Malware detection based on dependency graph using hybrid genetic algorithm , 2010, GECCO '10.

[15]  Yuval Elovici,et al.  Detecting unknown malicious code by applying classification techniques on OpCode patterns , 2012, Security Informatics.

[16]  Lingyu Wang,et al.  BinComp: A stratified approach to compiler provenance Attribution , 2015, Digit. Investig..

[17]  Arvind Narayanan,et al.  De-anonymizing Programmers via Code Stylometry , 2015, USENIX Security Symposium.

[18]  Curtis R. Cook,et al.  A programming style taxonomy , 1991, J. Syst. Softw..

[19]  Somesh Jha,et al.  OmniUnpack: Fast, Generic, and Safe Unpacking of Malware , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[20]  Nathan Krislock,et al.  Euclidean Distance Matrices and Applications , 2012 .

[21]  Lingyu Wang,et al.  SIGMA: A Semantic Integrated Graph Matching Approach for identifying reused functions in binary code , 2015, Digit. Investig..

[22]  Pascal Junod,et al.  Obfuscator-LLVM -- Software Protection for the Masses , 2015, 2015 IEEE/ACM 1st International Workshop on Software Protection.

[23]  Barton P. Miller,et al.  Who Wrote This Code? Identifying the Authors of Program Binaries , 2011, ESORICS.

[24]  Robert Muth Register Liveness Analysis of Executable Code , 2012 .

[25]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[26]  Konrad Rieck,et al.  Structural detection of android malware using embedded call graphs , 2013, AISec.

[27]  Mark Stamp,et al.  Hunting for undetectable metamorphic viruses , 2011, Journal in Computer Virology.

[28]  Anirban Dasgupta,et al.  Fast locality-sensitive hashing , 2011, KDD.

[29]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[30]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[31]  Chris Eagle,et al.  The IDA Pro Book: The Unofficial Guide to the World's Most Popular Disassembler , 2008 .

[32]  Lingyu Wang,et al.  BinShape: Scalable and Robust Binary Library Function Identification Using Function Shape , 2017, DIMVA.

[33]  Wanlei Zhou,et al.  Control Flow-Based Malware VariantDetection , 2014, IEEE Transactions on Dependable and Secure Computing.

[34]  Ramesh Karri,et al.  Detecting Kernel Control-Flow Modifying Rootkits , 2014, Network Science and Cybersecurity.

[35]  Dawn Xiaodong Song,et al.  On the Feasibility of Internet-Scale Author Identification , 2012, 2012 IEEE Symposium on Security and Privacy.

[36]  Lingyu Wang,et al.  BINARM: Scalable and Efficient Detection of Vulnerabilities in Firmware Images of Intelligent Electronic Devices , 2018, DIMVA.

[37]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[38]  Anthony Desnos Android: From Reversing to Decompilation , 2011 .

[39]  Václav Rajlich,et al.  Software evolution and maintenance , 2014, FOSE.

[40]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[41]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[42]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[43]  Lingyu Wang,et al.  OBA2: An Onion approach to Binary code Authorship Attribution , 2014, Digit. Investig..

[44]  Mauricio A. Saca Refactoring improving the design of existing code , 2017, 2017 IEEE 37th Central America and Panama Convention (CONCAPAN XXXVII).

[45]  Heng Yin,et al.  Renovo: a hidden code extractor for packed executables , 2007, WORM '07.

[46]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[47]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[48]  Eran Yahav,et al.  Similarity of binaries through re-optimization , 2017, PLDI.

[49]  Thomas W. Reps,et al.  WYSINWYX: What you see is not what you eXecute , 2005, TOPL.

[50]  Naeem Seliya,et al.  Detecting outsourced student programming assignments , 2008 .

[51]  Donald E. Knuth,et al.  backus normal form vs. Backus Naur form , 1964, CACM.

[52]  李铮,et al.  Google Code Jam之分秒必争 , 2014 .

[53]  Junfeng Wang,et al.  Malware detection method based on the control-flow construct feature of software , 2014, IET Inf. Secur..

[54]  Lingyu Wang,et al.  On the Feasibility of Malware Authorship Attribution , 2016, FPS.

[55]  Lingyu Wang,et al.  FOSSIL: A Resilient and Efficient System for Identifying FOSS Functions in Malware Binaries , 2018, ACM Trans. Priv. Secur..

[56]  Eugene H. Spafford,et al.  Software forensics: Can we track code to its authors? , 1993, Comput. Secur..

[57]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[58]  Lingyu Wang,et al.  On Leveraging Coding Habits for Effective Binary Authorship Attribution , 2018, ESORICS.

[59]  Barton P. Miller,et al.  Identifying Multiple Authors in a Binary Program , 2017, ESORICS.

[60]  Andrew Turpin,et al.  Application of Information Retrieval Techniques for Source Code Authorship Attribution , 2009, DASFAA.

[61]  Lingyu Wang,et al.  BinGold: Towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGs) , 2016 .