Authorship identification from unstructured texts

Authorship identification is a task of identifying authors of anonymous texts given examples of the writing of authors. The increasingly large volumes of anonymous texts on the Internet enhance the great yet urgent necessity for authorship identification. It has been applied to more and more practical applications including literary works, intelligence, criminal law, civil law, and computer forensics. In this paper, we propose a semantic association model about voice, word dependency relations, and non-subject stylistic words to represent the writing style of unstructured texts of various authors, design an unsupervised approach to extract stylistic features, and employ principal components analysis and linear discriminant analysis to identify authorship of texts. This paper provides a uniform quantified method to capture syntactic and semantic stylistic characteristics of and between words and phrases, and this approach can solve the problem of the independence of different dimensions to some extent. Experimental results on two English text corpora show that our approach significantly improves the overall performance over authorship identification.

[1]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[2]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[3]  Dale Schuurmans,et al.  Language independent authorship attribution using character level language models , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[4]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[5]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[6]  Efstathios Stamatatos,et al.  Author identification: Using text sampling to handle the class imbalance problem , 2008, Inf. Process. Manag..

[7]  Shlomo Argamon,et al.  Author Identification on the Large Scale , 2005 .

[8]  Qi Tian,et al.  Integrating Discriminant and Descriptive Information for Dimension Reduction and Classification , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[10]  Hans Van Halteren,et al.  Author verification by linguistic profiling: An exploration of the parameter space , 2007, TSLP.

[11]  Efstathios Stamatatos,et al.  Authorship Attribution Based on Feature Set Subspacing Ensembles , 2006, Int. J. Artif. Intell. Tools.

[12]  John W. Sammon,et al.  An Optimal Set of Discriminant Vectors , 1975, IEEE Transactions on Computers.

[13]  David G. Stork,et al.  Pattern Classification , 1973 .

[14]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[15]  Benjamin C. M. Fung,et al.  A unified data mining solution for authorship analysis in anonymous textual communications , 2013, Inf. Sci..

[16]  Walter Daelemans,et al.  Shallow Text Analysis and Machine Learning for Authorship Attribtion , 2005, CLIN.

[17]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[18]  Hans van Halteren,et al.  New Machine Learning Methods Demonstrate the Existence of a Human Stylome , 2005, J. Quant. Linguistics.

[19]  Stefanos Gritzalis,et al.  Examining the significance of high-level programming features in source code author classification , 2008, J. Syst. Softw..

[20]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[21]  Jun Huan,et al.  Knowledge Transfer with Low-Quality Data: A Feature Extraction Issue , 2011, IEEE Transactions on Knowledge and Data Engineering.

[22]  D Jeulin,et al.  Texture classification by statistical learning from morphological image processing: application to metallic surfaces , 2010, Journal of microscopy.

[23]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[24]  Ja-Chen Lin,et al.  A new LDA-based face recognition system which can solve the small sample size problem , 1998, Pattern Recognit..

[25]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[26]  Sinisa Todorovic,et al.  Local-Learning-Based Feature Selection for High-Dimensional Data Analysis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[28]  E. Stamatatos Ensemble-based Author Identification Using Character N-grams , 2006 .

[29]  Danielle S. McNamara,et al.  Analyzing Writing Styles with Coh-Metrix , 2006, FLAIRS.

[30]  Liviu P. Dinu,et al.  Authorship Identification of Romanian Texts with Controversial Paternity , 2008, LREC.

[31]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[32]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[33]  José Nilo G. Binongo,et al.  Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution , 2003 .

[34]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[35]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[36]  Carole E. Chaski,et al.  Who's At The Keyboard? Authorship Attribution in Digital Evidence Investigations , 2005, Int. J. Digit. EVid..

[37]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[38]  J. F. Burrows,et al.  Not Unles You Ask Nicely: The Interpretative Nexus Between Analysis and Information , 1992 .

[39]  L. Duchene,et al.  An Optimal Transformation for Discriminant and Principal Component Analysis , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Stella Markantonatou,et al.  Discriminating the Registers and Styles in the Modern Greek Language-Part 2: Extending the Feature Vector to Optimize Author Discrimination , 2004, Lit. Linguistic Comput..

[41]  Tsau Young Lin,et al.  An Automata Based Authorship Identification System , 2009, PAKDD Workshops.

[42]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[43]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Mutual Information Feature Selection , 2012 .

[44]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features , 2007, J. Assoc. Inf. Sci. Technol..

[45]  David L. Hoover,et al.  Another Perspective on Vocabulary Richness , 2003, Comput. Humanit..

[46]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[47]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[48]  Efstathios Stamatatos,et al.  N-Gram Feature Selection for Authorship Identification , 2006, AIMSA.

[49]  Jiawei Han,et al.  Training Linear Discriminant Analysis in Linear Time , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[50]  Tufan Taş,et al.  Author Identification for Turkish Texts , 2007 .

[51]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..