Authorship Attribution Using Stylometry and Machine Learning Techniques

Plagiarism is considered to be a highly unethical activity in the academic world. Text-alignment is currently the preferred technique for estimating the degree of similarity with existing written works. Due to its dependency on other documents it becomes increasingly tedious and time-consuming to scale up to the growing number of online and offline documents. Thus, this paper aims at studying the use of stylometric features present in a document in order to verify its authorship. Two machine learning algorithms, namely k-NN and SMO, were used to predict the authenticity of the writings. A computer program consisting of 446 features was implemented. Ten PhD theses, split into different segments of 1000, 5000 and 10000 words, were used, totaling 520 documents as our corpus. Our results show that authorship attribution using stylometry method has generated an accuracy of above 90 %, except for 7-NN with 1000 words. We also showed how authorship attribution can be used to identify potential cases of plagiarism in formal writings.

[1]  Hsinchun Chen,et al.  Visualizing Authorship for Identification , 2006, ISI.

[2]  Tuomo Kakkonen,et al.  Automatic Student Plagiarism Detection: Future Perspectives , 2010 .

[3]  Sangkyum Kim,et al.  Authorship classification: a discriminative syntactic tree mining approach , 2011, SIGIR.

[4]  Luiz Eduardo Soares de Oliveira,et al.  Author Identification using Stylometric Features , 2007, Inteligencia Artif..

[5]  Walter Daelemans,et al.  Authorship Attribution and Verification with Many Authors and Limited Data , 2008, COLING.

[6]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[7]  Samuel J. Horovitz Two Wrongs Don't Negate a Copyright: Don't Make Students Turnitin if You Won't Give it Back , 2008 .

[8]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..

[9]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[10]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[11]  S. Argamon,et al.  The “Fundamental Problem” of Authorship Attribution , 2012 .

[12]  David L. Hoover,et al.  Frequent Collocations and Authorial Style , 2003, Lit. Linguistic Comput..

[13]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[14]  Lamia Hadrich Belguith,et al.  Author Profiling Using Style-based Features Notebook for PAN at CLEF 2013 , 2013, CLEF.

[15]  José Francisco Martínez Trinidad,et al.  A New Document Author Representation for Authorship Attribution , 2012, MCPR.

[16]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[17]  Hans van Halteren,et al.  Linguistic Profiling for Authorship Recognition and Verification , 2004, ACL.

[18]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[19]  Louise Guthrie,et al.  Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation , 2008, LREC.

[20]  Hermann A. Maurer,et al.  Plagiarism - A Problem And How To Fight It , 2007 .

[21]  Krzysztof A. Cyran Machine learning approach to authorship attribution of literary texts , 2007 .

[22]  Efstathios Stamatatos,et al.  Author identification: Using text sampling to handle the class imbalance problem , 2008, Inf. Process. Manag..

[23]  Paolo Rosso,et al.  Authorship Attribution Using Word Sequences , 2006, CIARP.

[24]  Rajiv V. Dharaskar,et al.  Comparative study of Authorship Identification Techniques for Cyber Forensics Analysis , 2014, ArXiv.