Scripting DNA: Identifying the JavaScript programmer

The attribution of authorship is required in diverse applications, ranging from ancient novels (Shakespeare's work, Federalist papers) for historical interest to recent novels for linguistic research or even out of curiosity (Robert Galbraith alias J.K.Rowling). For this problem extensive research has resulted in effective general purpose methods. Also, for other types of text the original author needs to be discovered. Especially, we are interested in methods to identify JavaScript programmers, which can be used to reveal the offender who produced malicious software on a website. So far, for this hardly studied problem, mainly general purpose methods from natural language authorship attribution have been applied. Moreover, no suitable reference dataset is available to allow for method evaluation and method development in a supervised machine learning approach. In this work we first obtain a reference dataset of substantial size and quality. Further, we propose to extract structural features from the Abstract Syntax Tree (AST) to describe the coding style of an author. In the experiments, we show that the specifically designed features indeed improve the authorship attribution of scripting code to programmers, especially in addition to character n-gram features.

[1]  Nektaria Potha,et al.  A Profile-Based Method for Authorship Verification , 2014, SETN.

[2]  Matthew F. Tennyson On Improving Authorship Attribution of Source Code , 2012, ICDF2C.

[3]  Magdalena Jankowska,et al.  Author Verification Using Common N-Gram Profiles of Text Documents , 2014, COLING.

[4]  Spiros Mancoridis,et al.  Using code metric histograms and genetic algorithms to perform author identification for software forensics , 2007, GECCO '07.

[5]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[6]  Efstathios Stamatatos,et al.  Author Identification in Imbalanced Sets of Source Code Samples , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.

[7]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[8]  Stephen G. MacDonell,et al.  Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis , 1999, ICONIP'99. ANZIIS'99 & ANNES'99 & ACNN'99. 6th International Conference on Neural Information Processing. Proceedings (Cat. No.99EX378).

[9]  Günther Specht,et al.  Enhancing Authorship Attribution By Utilizing Syntax Tree Profiles , 2014, EACL.

[10]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[11]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[12]  Cor J. Veenman,et al.  Forensic Authorship Attribution Using Compression Distances to Prototypes , 2009, IWCF.

[13]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..

[14]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[15]  Andrew Turpin,et al.  Comparing techniques for authorship attribution of source code , 2014, Softw. Pract. Exp..

[16]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  George K. Mikros,et al.  Authorship Attribution in Greek Tweets Using Author's Multilevel N-Gram Profiles , 2013, AAAI Spring Symposium: Analyzing Microtext.

[19]  Efstathios Stamatatos,et al.  On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[20]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[21]  Efstathios Stamatatos,et al.  N-Gram Feature Selection for Authorship Identification , 2006, AIMSA.

[22]  Stefanos Gritzalis,et al.  Examining the significance of high-level programming features in source code author classification , 2008, J. Syst. Softw..

[23]  Sangkyum Kim,et al.  Authorship classification: a discriminative syntactic tree mining approach , 2011, SIGIR.

[24]  Curtis R. Cook,et al.  Programming style authorship analysis , 1989, CSC '89.

[25]  Georgios Gousios,et al.  Lean GHTorrent: GitHub data on demand , 2014, MSR 2014.

[26]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[27]  Efstathios Stamatatos,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[28]  Mansur H. Samadzadeh,et al.  Extraction of Java program fingerprints for software authorship identification , 2004, J. Syst. Softw..

[29]  Dawn Xiaodong Song,et al.  On the Feasibility of Internet-Scale Author Identification , 2012, 2012 IEEE Symposium on Security and Privacy.

[30]  F. Mosteller,et al.  A comparative study of discrimination methods applied to the authorship of the disputed Federalist papers , 2016 .

[31]  Gilles Roussel,et al.  Syntax tree fingerprinting: a foundation for source code similarity detection , 2009 .

[32]  Arvind Narayanan,et al.  De-anonymizing Programmers via Code Stylometry , 2015, USENIX Security Symposium.

[33]  Danielle S. McNamara,et al.  Analyzing Writing Styles with Coh-Metrix , 2006, FLAIRS.

[34]  Steven Burrows Source code authorship attribution , 2010 .

[35]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[36]  Andrew Turpin,et al.  Application of Information Retrieval Techniques for Source Code Authorship Attribution , 2009, DASFAA.

[37]  Gerhard Weikum,et al.  Combining Text and Linguistic Document Representations for Authorship Attribution , 2005 .