A Comparative Study of Language Models for Book and Author Recognition

Linguistic information can help improve evaluation of similarity between documents; however, the kind of linguistic information to be used depends on the task. In this paper, we show that distributions of syntactic structures capture the way works are written and accurately identify individual books more than 76% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 66%. However, testing the same features on authorship attribution shows that distributions of syntactic structures are less successful than function words on this task; syntactic structures vary even among the works of the same author whereas features such as function words are distributed more similarly among the works of an author and can more effectively capture authorship.

[1]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[2]  C. B. Williams Mendenhall's studies of word-length distribution in the works of Shakespeare and Bacon , 1975 .

[3]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[4]  Boris Katz,et al.  Capturing Expression Using Linguistic Information , 2005, AAAI.

[5]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[6]  Patrick Henry Winston,et al.  Artificial intelligence at MIT: expanding frontiers , 1991 .

[7]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[8]  Boris Katz,et al.  Identifying expression fingerprints using linguistic information , 2005 .

[9]  김두식,et al.  English Verb Classes and Alternations , 2006 .

[10]  Nicolas W. Hengartner,et al.  Quantitative Analysis of Literary Styles , 2002 .

[11]  Dmitry V. Khmelev,et al.  Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..

[12]  Vasa D. Mihailovich Geir Kjetsaa, Sven Gustavsson, Bengt Beckman, and Steinar Gil, The Authorship of The Quiet Don , 1985 .

[13]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[14]  Peter Bock,et al.  A Preliminary Statistical Investigation into the Impace of an N-Gram Analysis Approach Based on World Syntactic Categories Toward Text Author Classification , 2000 .

[15]  D. Alexander,et al.  Some classes of verbs in english , 1964 .

[16]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[17]  H. S. Sichel,et al.  On a Distribution Representing Sentence‐Length in Written Prose , 1974 .

[18]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[19]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[20]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[21]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[22]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[23]  Michael Halliday,et al.  Cohesion in English , 1976 .

[24]  Thea van der Geest,et al.  The New Writing Environment: Writers at Work in a World of Technology , 1996, The New Writing Environment.

[25]  Boris Katz,et al.  Using Syntactic Information to Identify Plagiarism , 2005 .

[26]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[27]  Dmitry V. Khmelev,et al.  Using Markov Chains for Identification of Writer , 2001, Lit. Linguistic Comput..

[28]  Fiona J. TweedieNovember Using Markov Chains for Identification of Writers , 2002 .

[29]  David D. Denison,et al.  Nonlinear estimation and classification , 2003 .

[30]  Boris Katz,et al.  Using English for Indexing and Retrieving , 1991 .

[31]  Claude S. Brinegar,et al.  Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship , 1963 .

[32]  Kathleen R. McKeown,et al.  SIMFINDER: A Flexible Clustering Tool for Summarization , 2001 .

[33]  Susan Conrad,et al.  Corpus Linguistics: Investigating Language Structure and Use , 1998 .

[34]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[35]  Boris Katz,et al.  Exploiting Lexical Regularities in Designing Natural Language Systems , 1988, COLING.

[36]  John Charles Baker,et al.  Pace: A Test of Authorship Based on the Rate at which New Words Enter an Author's Text , 1988 .

[37]  D. Biber A typology of English texts , 1989 .

[38]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[39]  Barbara B. Levin,et al.  English verb classes and alternations , 1993 .

[40]  Ido Dagan,et al.  A Corpus-Independent Feature Set for Style-Based Text Categorization , 2003 .

[41]  A. Q. Morton The Authorship of Greek Prose , 1965 .

[42]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[43]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[44]  George A. Miller,et al.  Length-Frequency Statistics for Written English , 1958, Inf. Control..

[45]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[46]  Graeme Hirst,et al.  Detecting Stylistic Inconsistencies in Collaborative Writing , 1996, The New Writing Environment.