Ngram and Bayesian Classification of Documents for Topic and Authorship

Large, real world, data sets have been investigated in the context of Authorship Attribution of real world documents. Ngram measures can be used to accurately assign authorship for long documents such as novels. A number of 5 (authors) x 5 (movies) arrays of movie reviews were acquired from the Internet Movie Database. Both ngram and naive Bayes classifiers were used to classify along both the authorship and topic (movie) axes. Both approaches yielded similar results, and authorship was as accurately detected, or more accurately detected, than topic. Part of speech tagging and function-word lists were used to investigate the influence of structure on classification tasks on documents with meaning removed but grammatical structure intact.

[1]  S Waugh,et al.  Computational stylistics using artificial neural networks , 2000 .

[2]  Ramon Ferrer-i-Cancho,et al.  Quantifying the Semantic Contribution of Particles , 2002, J. Quant. Linguistics.

[3]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  Johan F. Hoorn,et al.  Neural network identification of poets using letter sequences , 1999 .

[6]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[7]  R. Harald Baayen,et al.  Statistical models for word frequency distributions: A linguistic evaluation , 1992, Comput. Humanit..

[8]  M Juillard,et al.  Words in the hood: a new look at the distribution of words in texts , 1997 .

[9]  David L. Hoover,et al.  Statistical Stylistics and Authorship Attribution: an Empirical Investigation , 2001, Lit. Linguistic Comput..

[10]  Eric Atwell,et al.  Customising a Copying-Identifier for Biomedical Science Student Reports: Comparing Simple and Smart Analyses , 2002, AICS.

[11]  A. Q. Morton,et al.  Analysing for authorship : a guide to the cusum technique , 1996 .

[12]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[13]  C. W. F. McKenna,et al.  The Statistical Analysis of Style: Reflections on Form, Meaning, and Ideology in the 'Nausicaa' Episode of Ulysses , 2001, Lit. Linguistic Comput..

[14]  John Lennon,et al.  Traditional and Emotional Stylometric Analysis of the Songs of Beatles , 2004 .

[15]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[16]  Jose Nilo G. Binongo,et al.  The application of principal component analysis to stylometry , 1999 .

[17]  David L. Hoover Frequent Word Sequences and Statistical Stylistics , 2002, Lit. Linguistic Comput..

[18]  Peter Bock,et al.  A Preliminary Statistical Investigation into the Impace of an N-Gram Analysis Approach Based on World Syntactic Categories Toward Text Author Classification , 2000 .

[19]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[20]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[21]  Helmut Pruscha Statistical Models for Vocabulary and Text Length with an Application to the NT Corpus , 1998 .

[22]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[23]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[24]  Robert J. Valenza Are the Thisted-Efron authorship tests valid? , 1991, Comput. Humanit..

[25]  Leonard R. N. Ashley,et al.  Authorship and evidence : a study of attribution and the Renaissance drama , 1968 .

[26]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[27]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[28]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[29]  Gerhard Fischer,et al.  Knowledge Management: Problems, Promises, Realities, and Challenges , 2001, IEEE Intell. Syst..

[30]  J. F. Burrows,et al.  Not Unles You Ask Nicely: The Interpretative Nexus Between Analysis and Information , 1992 .

[31]  Dmitry V. Khmelev,et al.  Using Markov Chains for Identification of Writer , 2001, Lit. Linguistic Comput..

[32]  George K. Barr Graphical Analysis of the Sentence Length Distribution Curve and Non-rational Components , 2001, Lit. Linguistic Comput..

[33]  Cynthia Whissell,et al.  Traditional and emotional stylometric analysis of the songs of Beatles Paul McCartney and John Lennon , 1996, Comput. Humanit..