A Classifier System for Author Recognition Using Synonym-Based Features

The writing style of an author is a phenomenon that computer scientists and stylometrists have modeled in the past with some success. However, due to the complexity and variability of writing styles, simple models often break down when faced with real world data. Thus, current trends in stylometry often employ hundreds of features in building classifier systems. In this paper, we present a novel set of synonym-based features for author recognition. We outline a basic model of how synonyms relate to an author's identify and then build an additional two models refined to meet real world needs. Experiments show strong correlation between the presented metric and the writing style of four authors with the second of the three models outperforming the others. As modern stylometric classifier systems demand increasingly larger feature sets, this new set of synonym-based features will serve to fill this ever-increasing need.

[1]  Fiona J. TweedieNovember Using Markov Chains for Identification of Writers , 2002 .

[2]  Dmitry V. Khmelev,et al.  Using Markov Chains for Identification of Writer , 2001, Lit. Linguistic Comput..

[3]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[4]  J. A. Smith,et al.  Stylistic Constancy and Change Across Literary Corpora: Using Measures of Lexical Richness to Date Works , 2002, Comput. Humanit..

[5]  Arthur C. Graesser,et al.  Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[6]  Claude S. Brinegar,et al.  Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship , 1963 .

[7]  Dale Schuurmans,et al.  Language independent authorship attribution using character level language models , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[8]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[9]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[10]  Peter Dixon,et al.  Sentence-length and Authorship Attribution: the Case of Oliver Goldsmith , 2004, Lit. Linguistic Comput..

[11]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[12]  María J. Somodevilla,et al.  H-Tree: A data structure for fast path-retrieval in rooted trees. , 2007 .

[13]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[14]  W. Fucks ON MATHEMATICAL ANALYSIS OF STYLE , 1952 .

[15]  Mirella Lapata,et al.  10th Conference of the European Chapter of the Association for Computational Linguistics , 1999 .

[16]  Jonathan H. Clark,et al.  An Algorithm for Identifying Authors Using Synonyms , 2007, Eighth Mexican International Conference on Current Trends in Computer Science (ENC 2007).

[17]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[18]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[19]  Graeme Hirst,et al.  Detecting Stylistic Inconsistencies in Collaborative Writing , 1996, The New Writing Environment.