Authorship attribution of source code by using back propagation neural network based on particle swarm optimization

Authorship attribution is to identify the most likely author of a given sample among a set of candidate known authors. It can be not only applied to discover the original author of plain text, such as novels, blogs, emails, posts etc., but also used to identify source code programmers. Authorship attribution of source code is required in diverse applications, ranging from malicious code tracking to solving authorship dispute or software plagiarism detection. This paper aims to propose a new method to identify the programmer of Java source code samples with a higher accuracy. To this end, it first introduces back propagation (BP) neural network based on particle swarm optimization (PSO) into authorship attribution of source code. It begins by computing a set of defined feature metrics, including lexical and layout metrics, structure and syntax metrics, totally 19 dimensions. Then these metrics are input to neural network for supervised learning, the weights of which are output by PSO and BP hybrid algorithm. The effectiveness of the proposed method is evaluated on a collected dataset with 3,022 Java files belong to 40 authors. Experiment results show that the proposed method achieves 91.060% accuracy. And a comparison with previous work on authorship attribution of source code for Java language illustrates that this proposed method outperforms others overall, also with an acceptable overhead.

[1]  Eugene H. Spafford,et al.  Software forensics: Can we track code to its authors? , 1993, Comput. Secur..

[2]  Andrew Turpin,et al.  Application of Information Retrieval Techniques for Source Code Authorship Attribution , 2009, DASFAA.

[3]  Naeem Seliya,et al.  Detecting outsourced student programming assignments , 2008 .

[4]  Efstathios Stamatatos,et al.  Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process , 2010, Handbook of Research on Computational Forensics, Digital Crime, and Investigation.

[5]  E. Eugene Schultz,et al.  Beyond preliminary analysis of the WANK and OILZ worms: a case study of malicious code , 1993, Comput. Secur..

[6]  Spiros Mancoridis,et al.  A Probabilistic Approach to Source Code Authorship Identification , 2007, Fourth International Conference on Information Technology (ITNG'07).

[7]  Min Xie,et al.  An empirical analysis of data preprocessing for machine learning-based software cost estimation , 2015, Inf. Softw. Technol..

[8]  A. Jefferson Offutt,et al.  Recognizing authors: an examination of the consistent programmer hypothesis , 2010, Softw. Test. Verification Reliab..

[9]  Spiros Mancoridis,et al.  Using code metric histograms and genetic algorithms to perform author identification for software forensics , 2007, GECCO '07.

[10]  Stephen G. MacDonell,et al.  Software Forensics: Extending Authorship Analysis Techniques to Computer Programs , 2002 .

[11]  Yue Shi,et al.  A modified particle swarm optimizer , 1998, 1998 IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98TH8360).

[12]  Barton P. Miller,et al.  Who Wrote This Code? Identifying the Authors of Program Binaries , 2011, ESORICS.

[13]  Arvind Narayanan,et al.  De-anonymizing Programmers via Code Stylometry , 2015, USENIX Security Symposium.

[14]  M. Johnson,et al.  Circulating microRNAs in Sera Correlate with Soluble Biomarkers of Immune Activation but Do Not Predict Mortality in ART Treated Individuals with HIV-1 Infection: A Case Control Study , 2015, PloS one.

[15]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[16]  Michael R. Lyu,et al.  A hybrid particle swarm optimization-back-propagation algorithm for feedforward neural network training , 2007, Appl. Math. Comput..

[17]  Andrew Turpin,et al.  Comparing techniques for authorship attribution of source code , 2014, Softw. Pract. Exp..

[18]  Cor J. Veenman,et al.  Scripting DNA: Identifying the JavaScript programmer , 2015, Digit. Investig..

[19]  Spiros Mancoridis,et al.  On the Use of Discretized Source Code Metrics for Author Identification , 2009, 2009 1st International Symposium on Search Based Software Engineering.

[20]  Nader Fathianpour,et al.  A hybrid simultaneous perturbation artificial bee colony and back-propagation algorithm for training a local linear radial basis neural network on ore grade estimation , 2017, Neurocomputing.

[21]  Hyun-il Lim,et al.  A method for detecting the theft of Java programs through analysis of the control flow information , 2009, Inf. Softw. Technol..

[22]  Stephen G. MacDonell,et al.  IDENTIFIED: A Dictionary-Based System for Extracting Source Code Metrics for Software Forensics , 1998, ICSE 1998.

[23]  XieMin,et al.  An empirical analysis of data preprocessing for machine learning-based software cost estimation , 2015 .

[24]  Stan Matwin,et al.  A review on particle swarm optimization algorithm and its variants to clustering high-dimensional data , 2013, Artificial Intelligence Review.

[25]  Hazlee Azil Illias,et al.  Hybrid modified evolutionary particle swarm optimisation-time varying acceleration coefficient-artificial neural network for power transformer fault diagnosis , 2016 .

[26]  Antonio Neme,et al.  Stylistics analysis and authorship attribution algorithms based on self-organizing maps , 2015, Neurocomputing.

[27]  Brian Pellin Using Classification Techniques to Determine Source Code Authorship , 2006 .

[28]  Chao Ren,et al.  Optimal parameters selection for BP neural network based on particle swarm optimization: A case study of wind speed forecasting , 2014, Knowl. Based Syst..

[29]  S. Tahaghoghi,et al.  Source Code Authorship Attribution using n-grams Steven Burrows , 2007 .

[30]  A. H. Abu Bakar,et al.  Transformer Incipient Fault Prediction Using Combined Artificial Neural Network and Various Particle Swarm Optimisation Techniques , 2015, PloS one.

[31]  Arvind Narayanan,et al.  When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries , 2015, NDSS.

[32]  Qinghua Zheng,et al.  Software Plagiarism Detection with Birthmarks Based on Dynamic Key Instruction Sequences , 2015, IEEE Transactions on Software Engineering.

[33]  Mansur H. Samadzadeh,et al.  Extraction of Java program fingerprints for software authorship identification , 2004, J. Syst. Softw..

[34]  Yusuf Leblebici,et al.  Review of advances in neural networks: Neural design technology stack , 2016, Neurocomputing.