Deep Stylometry and Lexical & Syntactic Features Based Author Attribution on PLoS Digital Repository

In this paper, we address the problem of author attribution through unsupervised clustering using lexical and syntactic features and novel deep learning based Stylometric model. For this purpose, we download all available 158918 publications accessible till 1 July 2015 from PLOS.org - an open access digital repository of full text publications. After pre-processing, out of these, we use 803 single authored publications written by 203 unique authors. For unsupervised modeling, stylometric markers such as lexical and syntactic features are used as a distance matrix by employing k-Means clustering algorithm. For supervised modeling, we present a novel long short-term memory (LSTM) based deep learning model that predicts the testing accuracy of a given publication written by an author. Finally, our unsupervised model shows that 88.17% authors are classified into correct cluster (all papers written by the same author) with at most 0.2 coefficient of Entropy error. While our deep learning based model consistently shows above 95% accuracy across all the given testing samples of publications written by an author with an average loss of 0.21.

[1]  M.W.A. Smith,et al.  Forensic stylometry: A theoretical basis for further developments of practical methods , 1989 .

[2]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[3]  Daniel Voyer,et al.  Word frequency and laterality effects in lexical decision: Right hemisphere mechanisms , 2003, Brain and Language.

[4]  Benno Stein,et al.  Intrinsic Plagiarism Detection , 2006, ECIR.

[5]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[6]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[7]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[8]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..

[9]  Norwati Mustapha,et al.  Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation , 2010 .

[10]  Norman Meuschke,et al.  Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence , 2011, DocEng '11.

[11]  Maciej Eder,et al.  Style-markers in authorship attribution : a cross-language study of the authorial fingerprint , 2011 .

[12]  David Yarowsky,et al.  Stylometric Analysis of Scientific Articles , 2012, NAACL.

[13]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[14]  Prabaharan Poornachandran,et al.  Stylometry detection using deep learning , 2017 .