On Leveraging Coding Habits for Effective Binary Authorship Attribution

We propose BinAuthor, a novel and the first compiler-agnostic method for identifying the authors of program binaries. Having filtered out unrelated functions (compiler and library) to detect user-related functions, it converts user-related functions into a canonical form to eliminate compiler/compilation effects. Then, it leverages a set of features based on collections of authors’ choices made during coding. These features capture an author’s coding habits. Our evaluation demonstrated that BinAuthor outperforms existing methods in several respects. First, when tested on large datasets extracted from selected open-source C/C++ projects in GitHub, Google Code Jam events, and Planet Source Code contests, it successfully attributed a larger number of authors with a significantly higher accuracy: around \(90\%\) when the number of authors is 1000. Second, when the code was subjected to refactoring techniques, code transformation, or processing using different compilers or compilation settings, there was no significant drop in accuracy, indicating that BinAuthor is more robust than previous methods.

[1]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[2]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[3]  Lingyu Wang,et al.  FOSSIL: A Resilient and Efficient System for Identifying FOSS Functions in Malware Binaries , 2018, ACM Trans. Priv. Secur..

[4]  Eugene H. Spafford,et al.  Software forensics: Can we track code to its authors? , 1993, Comput. Secur..

[5]  Lingyu Wang,et al.  BINARM: Scalable and Efficient Detection of Vulnerabilities in Firmware Images of Intelligent Electronic Devices , 2018, DIMVA.

[6]  Dennis Shasha,et al.  New techniques for extracting features from protein sequences , 2001, IBM Syst. J..

[7]  Eran Yahav,et al.  Similarity of binaries through re-optimization , 2017, PLDI.

[8]  Lingyu Wang,et al.  OBA2: An Onion approach to Binary code Authorship Attribution , 2014, Digit. Investig..

[9]  Bi Liu,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[11]  J. Gregory Morrisett,et al.  Evaluating value-graph translation validation for LLVM , 2011, PLDI '11.

[12]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[13]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[14]  Lingyu Wang,et al.  SIGMA: A Semantic Integrated Graph Matching Approach for identifying reused functions in binary code , 2015, Digit. Investig..

[15]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[16]  Václav Rajlich,et al.  Software evolution and maintenance , 2014, FOSE.

[17]  Arvind Narayanan,et al.  When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries , 2015, NDSS.

[18]  Petteri Kaski,et al.  Engineering an Efficient Canonical Labeling Tool for Large and Sparse Graphs , 2007, ALENEX.

[19]  Lingyu Wang,et al.  BinShape: Scalable and Robust Binary Library Function Identification Using Function Shape , 2017, DIMVA.

[20]  Barton P. Miller,et al.  Identifying Multiple Authors in a Binary Program , 2017, ESORICS.

[21]  Lingyu Wang,et al.  On the Feasibility of Malware Authorship Attribution , 2016, FPS.

[22]  Pascal Junod,et al.  Obfuscator-LLVM -- Software Protection for the Masses , 2015, 2015 IEEE/ACM 1st International Workshop on Software Protection.

[23]  Barton P. Miller,et al.  Who Wrote This Code? Identifying the Authors of Program Binaries , 2011, ESORICS.

[24]  Donald E. Knuth,et al.  backus normal form vs. Backus Naur form , 1964, CACM.