Universality of Stylistic Traits in Texts

The style of documents is an important property that can be used as discriminant factor in text mining applications. Among the great number of possible measures proposed to quantify writing style there are some features that can be characterized as universal, in the sense that they can be easily extracted from any kind of text in practically any natural language and provide accurate results when used in style-based text categorization tasks. In this paper we examine whether such universal stylometric features remain effective under difficult scenarios where the topic and/or genre of documents used in the training phase differ from that of the questioned documents. Based on a series of experiments in authorship attribution, we demonstrate that character n-gram features are reliable and effective given that the appropriate number of features is used. It is also shown that when the number of candidate authors increases, the representation dimensionality should also increase to improve classification results.

[1]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[2]  Benno Stein,et al.  Genre classification of Web pages user study and feasibility analysis , 2004 .

[3]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[4]  Carole E. Chaski,et al.  Who's At The Keyboard? Authorship Attribution in Digital Evidence Investigations , 2005, Int. J. Digit. EVid..

[5]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[6]  Efstathios Stamatatos,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[7]  J. F. Burrows,et al.  Not Unles You Ask Nicely: The Interpretative Nexus Between Analysis and Information , 1992 .

[8]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[9]  Benno Stein,et al.  Genre Classification of Web Pages , 2004, KI.

[10]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[11]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[12]  Walter Daelemans,et al.  Shallow Text Analysis and Machine Learning for Authorship Attribtion , 2005, CLIN.

[13]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[14]  Gil-Chang Kim,et al.  Multiple sets of features for automatic genre classification of web documents , 2005, Inf. Process. Manag..

[15]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[16]  Hans Van Halteren,et al.  Author verification by linguistic profiling: An exploration of the parameter space , 2007, TSLP.

[17]  Efstathios Stamatatos,et al.  Intrinsic Plagiarism Detection Using Character n-gram Profiles , 2009 .

[18]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[19]  S. Fienberg,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[20]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[21]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[22]  Vittorio Murino,et al.  Conversationally-inspired stylometric features for authorship attribution in instant messaging , 2012, ACM Multimedia.

[23]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[24]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features , 2007, J. Assoc. Inf. Sci. Technol..

[25]  Efstathios Stamatatos,et al.  Learning to recognize webpage genres , 2009, Inf. Process. Manag..

[26]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[27]  Walter Daelemans,et al.  Authorship Attribution and Verification with Many Authors and Limited Data , 2008, COLING.

[28]  Hugo Jair Escalante,et al.  Local Histograms of Character N-grams for Authorship Attribution , 2011, ACL.

[29]  Moshe Koppel,et al.  Determining if two documents are written by the same author , 2014, J. Assoc. Inf. Sci. Technol..

[30]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[31]  G. Udny Yule,et al.  The statistical study of literary vocabulary , 1944 .

[32]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[33]  C. E. Veni Madhavan,et al.  Stopword Graphs and Authorship Attribution in Text Corpora , 2009, 2009 IEEE International Conference on Semantic Computing.

[34]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[35]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[36]  Efstathios Stamatatos,et al.  Plagiarism detection using stopword n-grams , 2011, J. Assoc. Inf. Sci. Technol..

[37]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[38]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[39]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .