Stopword Graphs and Authorship Attribution in Text Corpora

In this work we identify interactions of stopwords -noisewords- in text corpora as a fundamental feature to effect author classification. It is convenient to view such interactions as graphs wherein nodes are stopwords and the interaction between a pair of stopwords are represented as edge-weights. We define the interaction in terms of the distances between pairs of stopwords in text documents. Given a list of authors, graphs for each author is computed based on their undisputed writings. Authorship of a test document is attributed based on the closeness of the graph derived from it to the above graphs. Towards this, we define a closeness measure to compare such graphs based on the Kullback-Leibler divergence. We illustrate the accuracy of our approach by applying it on examples drawn from the Gutenberg archives. Our results show that the proposed approach is effective not only in binary author classification but also performs multiclass author classification for as many as 10 authors at a time and compares favourably with the state-of-the-art in author identification.

[1]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[2]  Stefanos Gritzalis,et al.  Source Code Author Identification Based on N-gram Author Profiles , 2006, AIAI.

[3]  Colin Martindale,et al.  On the utility of content analysis in author attribution:The Federalist , 1995, Comput. Humanit..

[4]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[5]  José Nilo G. Binongo,et al.  Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution , 2003 .

[6]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[7]  S. Fienberg,et al.  Whose Ideas? Whose Words? Authorship of Ronald Reagan's Radio Addresses , 2007, PS: Political Science & Politics.

[8]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[9]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[10]  Roxanna Paez,et al.  Stephen Crane and the New-York Tribune: A Case Study in Traditional and Non-Traditional Authorship Attribution , 2001, Comput. Humanit..

[11]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[12]  Albert-László Barabási,et al.  Evolution of Networks: From Biological Nets to the Internet and WWW , 2004 .

[13]  Glenn Fung,et al.  The disputed federalist papers: SVM feature selection via concave minimization , 2003, TAPIA '03.

[14]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[15]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[16]  Nicolas W. Hengartner,et al.  Quantitative Analysis of Literary Styles , 2002 .

[17]  Justin Zobel,et al.  Using Relative Entropy for Authorship Attribution , 2006, AIRS.