Author identification based on word distribution in word space

Author attribution has grown into an area that is more challenging from the past decade. It has become an inevitable task in many sectors like forensic analysis, law, journalism and many more as it helps to detect the author in every documentation. Here unigram/bigram features along with latent semantic features from word space were taken and the similarity of a particular document was tested using Random forest tree, Logistic Regression and Support Vector Machine in order to create a global model. Dataset from PAN Author Identification shared task 2014 is taken for processing. It has been observed that the proposed model shows state-of-art accuracy of 80% which is significantly greater when compared to the Author Identification PAN results of the year 2014.

[1]  Youssef Iraqi,et al.  A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF) , 2014, CLEF.

[2]  Mihaela Juganaru-Mathieu,et al.  UJM at CLEF in Author Verification based on optimized classification trees , 2014 .

[3]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[4]  Darnes Vilariño Ayala,et al.  Unsupervised Method for the Authorship Identification Task , 2014, CLEF.

[5]  A. V.DavidSánchez,et al.  Advanced support vector machines and kernel methods , 2003, Neurocomputing.

[6]  M. Coulthard Author Identification, Idiolect, and Linguistic Uniqueness. , 2004 .

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Anand,et al.  A Statistical Analysis Approach to Author Identification Using Latent Semantic Analysis , 2014, CLEF.

[9]  Mark C. Baker,et al.  Linguistic differences and language design , 2003, Trends in Cognitive Sciences.

[10]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[11]  Richard A. Harshman,et al.  Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure , 1988, SIGIR Forum.

[12]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[13]  2015 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2015, Kochi, India, August 10-13, 2015 , 2015, ICACCI.

[14]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[15]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[16]  K. P. Soman,et al.  Machine Learning with SVM and other Kernel methods , 2009 .

[17]  Carl Vogel,et al.  Author Verification: Exploring a Large set of Parameters using a Genetic Algorithm - Notebook for PAN at CLEF 2014 , 2014, CLEF.

[18]  Iván V. Meza,et al.  A Single Author Style Representation for the Author Verification Task , 2014, CLEF.

[19]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[20]  Yves Peirsman,et al.  The automatic identification of lexical variation between language varieties , 2010, Natural Language Engineering.

[21]  Pashutan Modaresi,et al.  A Language Independent Author Verifier Using Fuzzy C-Means Clustering , 2014, CLEF.

[22]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[23]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.