Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process

Cybercrime has increased in severity and frequency in the recent years and because of this, it has become a major concern for companies, universities and organizations. The anonymity offered by the Internet has made the task of tracing criminal identity difficult. One study field that has contributed in tracing criminals is authorship analysis on e-mails, messages and programs. This paper contains a study on source code authorship analysis. The aim of the research efforts in this area is to identify the author of a particular piece of code by examining its programming style characteristics. Borrowing extensively from the existing fields of linguistics and software metrics, this field attempts to investigate various aspects of computer program authorship. Source code authorship analysis could be implemented in cases of cyber attacks, plagiarism and computer fraud. In this paper we present the set of tools and techniques used to achieve the goal of authorship identification, a review of the research efforts in the area and a new taxonomy on source code authorship analysis.

[1]  Hubert E. Dunsmore Software metrics: An overview of an evolving methodology , 1984, Inf. Process. Manag..

[2]  Edward L. Jones METRICS BASED PLAGIARISM MONITORING , 2001 .

[3]  Hugo T. Jankowitz Detecting Plagiarism in Student Pascal Programs , 1988, Comput. J..

[4]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[5]  Thomas Merriam Marlowe’s Hand in Edward III Revisited , 1996 .

[6]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[7]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[8]  Stephen G. MacDonell,et al.  A Fuzzy Logic Approach to Computer Software Source Code Authorship Analysis , 1997, ICONIP.

[9]  Jaime G. Carbonell,et al.  Comparative n-gram analysis of whole-genome protein sequences , 2002 .

[10]  Linda M. Ottenstein Quantitative Estimates of Debugging Requirements , 1979, IEEE Transactions on Software Engineering.

[11]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[12]  Agnar Aamodt,et al.  Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches , 1994, AI Commun..

[13]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[14]  C SchankRoger,et al.  Dynamic Memory: A Theory of Reminding and Learning in Computers and People , 1983 .

[15]  Eugene H. Spafford,et al.  Software forensics: Tracking code to its authors , 1993 .

[16]  Paul Clough,et al.  Plagiarism in natural and programming languages: an overview of current tools and technologies , 2000 .

[17]  Stephen G. MacDonell,et al.  IDENTIFIED: A Dictionary-Based System for Extracting Source Code Metrics for Software Forensics , 1998, ICSE 1998.

[18]  R. Morris,et al.  Computer detection of typographical errors , 1975, IEEE Transactions on Professional Communication.

[19]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[20]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[21]  Jun Zhang,et al.  LONG RANGE CORRELATION IN HUMAN WRITINGS , 1993 .

[22]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[23]  Stefanos Gritzalis,et al.  Source Code Author Identification Based on N-gram Author Profiles , 2006, AIAI.

[24]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[25]  Stephen G. MacDonell,et al.  IDENTIFIED (Integrated Dictionary-based Extraction of Non-language-dependent Token Information for Forensic Identification, Examination, and Discrimination): a dictionary-based system for extracting source code metrics for software forensics , 1998, Proceedings. 1998 International Conference Software Engineering: Education and Practice (Cat. No.98EX220).

[26]  Peter Seibel,et al.  Practical Common Lisp , 2005 .

[27]  George A. Miller,et al.  The science of words , 1991 .

[28]  D. W. Foster Author Unknown: On the Trail of Anonymous , 2000 .

[29]  Stephen G. MacDonell,et al.  Software Forensics: Extending Authorship Analysis Techniques to Computer Programs , 2002 .

[30]  J. Stephen Downie,et al.  Evaluating a simple approach to music information retrieval : conceiving melodic n-grams as text , 1999 .

[31]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[32]  Samuel L. Grier,et al.  A tool that detects plagiarism in Pascal programs , 1981, SIGCSE '81.

[33]  Robert J. Valenza,et al.  Was the Earl of Oxford the true Shakespeare , 1991 .

[34]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[35]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[36]  G. Whale Indentification of Program Similarity in Large Populations , 1990, Comput. J..

[37]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[38]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[39]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[40]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[41]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[42]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[43]  Hal Berghel,et al.  Measurements of program similarity in identical task environments , 1984, SIGP.

[44]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[45]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[46]  Stephen G. MacDonell,et al.  Forensics : : old methods for a new science , 2004 .

[47]  Carla Marceau,et al.  Characterizing the behavior of a program using multiple-length N-grams , 2001, NSPW '00.

[48]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[49]  Judith Klein-Seetharaman,et al.  PROTEINS: Structure, Function, and Bioinformatics 58:955–970 (2005) Protein Classification Based on Text Document Classification Techniques , 2022 .

[50]  Stephen G. MacDonell,et al.  Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis , 1999, ICONIP'99. ANZIIS'99 & ANNES'99 & ACNN'99. 6th International Conference on Neural Information Processing. Proceedings (Cat. No.99EX378).

[51]  T. de Heer Experiments with syntactic traces in information retrieval , 1974, Inf. Storage Retr..

[52]  Xufeng Lin,et al.  Source Camera Identification Issues: Forensic Features Selection and Robustness , 2011, Int. J. Digit. Crime Forensics.

[53]  John F. Burrows,et al.  ‘An ocean where each kind. . .’: Statistical analysis and some major determinants of literary style , 1989, Comput. Humanit..

[54]  E. Eugene Schultz,et al.  Beyond preliminary analysis of the WANK and OILZ worms: a case study of malicious code , 1993, Comput. Secur..

[55]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[56]  G. Udny Yule,et al.  The statistical study of literary vocabulary , 1944 .

[57]  Stefanos Gritzalis,et al.  Supporting the cybercrime investigation process: Effective discrimination of source code authors based on byte-level information , 2005, ICETE.

[58]  Youngho Kim Space-Time Measures of Crime Diffuson , 2008 .

[59]  Michael J. Wise,et al.  Software for detecting suspected plagiarism: comparing structure and attribute-counting systems , 1996, ACSE '96.

[60]  Stephen G. MacDonell,et al.  Software forensics applied to the task of discriminating between program authors , 2001 .

[61]  Eugene H. Spafford,et al.  The internet worm program: an analysis , 1989, CCRV.

[62]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[63]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[64]  Eugene H. Spafford,et al.  Software forensics: Can we track code to its authors? , 1993, Comput. Secur..

[65]  Keith Phalp,et al.  An investigation of machine learning based prediction systems , 2000, J. Syst. Softw..

[66]  Vili Podgorelec,et al.  Computer and natural language texts—a comparison based on long-range correlations , 1999 .

[67]  Michael J. Wise Detection of similarities in student programs: YAP'ing may be preferable to plague'ing , 1992, SIGCSE '92.

[68]  Robert Matthews,et al.  Connection strength from input Connection strength from hidden node i to hidden node j node j to output node k Discriminator , 2005 .

[69]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[70]  Curtis R. Cook,et al.  Programming style authorship analysis , 1989, CSC '89.

[71]  Roger C. Schank,et al.  Dynamic memory - a theory of reminding and learning in computers and people , 1983 .

[72]  Robert Bosch,et al.  Separating Hyperplanes and the Authorship of the Disputed Federalist Papers , 1998 .

[73]  Lin Liu Artificial Crime Analysis Systems: Using Computer Simulations and Geographic Information Systems , 2008 .

[74]  D. Holmes A Stylometric Analysis of Mormon Scripture and Related Texts , 1992 .

[75]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[76]  Mansur H. Samadzadeh,et al.  Extraction of Java program fingerprints for software authorship identification , 2004, J. Syst. Softw..