CPA: Accurate Cross-Platform Binary Authorship Characterization Using LDA

Binary authorship characterization refers to the process of identifying stylistic characteristics that are related to the author of an anonymous binary code. The aim is to automate the laborious and error-prone reverse engineering task of discovering information related to the author(s) of binary code. This paper presents <monospace>CPA</monospace>, a novel approach for characterizing the authors of program binaries. Instead of using generic features such as n-grams, <italic>CPA</italic> proposes a set of new features based on collections of various aspects of author style, including author code traits, code structure characteristics, and author expertise in solving coding tasks. It employs the Latent Dirichlet Allocation (LDA) algorithm to generate author style signatures to help identify similar author style characteristics in other binaries. We evaluated <monospace>CPA</monospace> on large datasets extracted from selected open-source C/C++ projects in GitHub and Google Code Jam events, and it successfully attributed a large number of authors with a significantly higher <inline-formula> <tex-math notation="LaTeX">$F_{1}$ </tex-math></inline-formula> score: around 91% when the number of authors was 1,500. In addition, the false positive rate was low, around 1.5%. When the code was subjected to refactoring techniques or code transformation or was processed using different compilers/compilation settings, there was no significant drop in accuracy, demonstrating the robustness of our tool. Finally, in the case of code written by multiple authors, <monospace>CPA</monospace> was able to identify the authors with a high <inline-formula> <tex-math notation="LaTeX">$F_{1}$ </tex-math></inline-formula> score, around 89%.

[1]  Lingyu Wang,et al.  On Leveraging Coding Habits for Effective Binary Authorship Attribution , 2018, ESORICS.

[2]  Barton P. Miller,et al.  Identifying Multiple Authors in a Binary Program , 2017, ESORICS.

[3]  Tyler Moore,et al.  Polymorphic malware detection using sequence classification methods and ensembles , 2017, EURASIP J. Inf. Secur..

[4]  Benjamin C. M. Fung,et al.  Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[5]  Jack W. Davidson,et al.  Zipr: Efficient Static Binary Rewriting for Security , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[6]  Pascal Junod,et al.  Obfuscator-LLVM -- Software Protection for the Masses , 2015, 2015 IEEE/ACM 1st International Workshop on Software Protection.

[7]  Barton P. Miller,et al.  Who Wrote This Code? Identifying the Authors of Program Binaries , 2011, ESORICS.

[8]  Xiangyu Zhang,et al.  RevARM: A Platform-Agnostic ARM Binary Rewriter for Security Applications , 2017, ACSAC.

[9]  Arvind Narayanan,et al.  De-anonymizing Programmers via Code Stylometry , 2015, USENIX Security Symposium.

[10]  Jia Wang,et al.  Truss Decomposition in Massive Networks , 2012, Proc. VLDB Endow..

[11]  Lingyu Wang,et al.  SIGMA: A Semantic Integrated Graph Matching Approach for identifying reused functions in binary code , 2015, Digit. Investig..

[12]  Sungho Kim,et al.  LARGen: Automatic Signature Generation for Malwares Using Latent Dirichlet Allocation , 2018, IEEE Transactions on Dependable and Secure Computing.

[13]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[14]  Arvind Narayanan,et al.  When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries , 2015, NDSS.

[15]  Saed Alrabaee,et al.  On the feasibility of binary authorship characterization , 2019, Digit. Investig..

[16]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[17]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[18]  Dennis Shasha,et al.  New techniques for extracting features from protein sequences , 2001, IBM Syst. J..

[19]  Michalis Vazirgiannis,et al.  A Graph Degeneracy-based Approach to Keyword Extraction , 2016, EMNLP.

[20]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[21]  Lingyu Wang,et al.  OBA2: An Onion approach to Binary code Authorship Attribution , 2014, Digit. Investig..

[22]  Lingyu Wang,et al.  BinComp: A stratified approach to compiler provenance Attribution , 2015, Digit. Investig..

[23]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[24]  Matthew Hicks,et al.  Full-Speed Fuzzing: Reducing Fuzzing Overhead through Coverage-Guided Tracing , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[25]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[26]  Stephen B. Seidman,et al.  Network structure and minimum degree , 1983 .

[27]  Dawu Gu,et al.  Automatically Patching Vulnerabilities of Binary Programs via Code Transfer From Correct Versions , 2019, IEEE Access.

[28]  Michalis Vazirgiannis,et al.  Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction , 2015, ECIR.

[29]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[30]  Lingyu Wang,et al.  On the Feasibility of Malware Authorship Attribution , 2016, FPS.

[31]  Gabriel Negreira Barbosa,et al.  Scientific but Not Academical Overview of Malware Anti-Debugging , Anti-Disassembly and Anti-VM Technologies , 2012 .

[32]  Jack W. Davidson,et al.  Xandra: An Autonomous Cyber Battle System for the Cyber Grand Challenge , 2018, IEEE Security & Privacy.

[33]  Xiangyu Zhang,et al.  Obfuscation resilient binary code reuse through trace-oriented programming , 2013, CCS.

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  Petteri Kaski,et al.  Engineering an Efficient Canonical Labeling Tool for Large and Sparse Graphs , 2007, ALENEX.

[36]  Konrad Rieck,et al.  Structural detection of android malware using embedded call graphs , 2013, AISec.