Authorship Attribution Using Lexical Attraction

Authorship attribution determines who wrote a text when it is unclear who wrote the text. Some examples are when two or more people claim to have written something or when no one is willing (or able) to say that he or she wrote the piece. In order to further the tools available for authorship attribution, I introduced lexical attraction as a way to distinguish authors. I implemented a program called StyleChooser that determines the author of a text, based on Yuret's lexical attraction parser. StyleChooser, once trained on a set of authors, determines how much information is redundant under each author model. Dividing by the number of words in the test text and by the log of the number of words used to train the model gives a metric used to rank the known authors in order of likelihood that they wrote the text in question. I then tested StyleChooser and analyzed the results. When tested with knowledge of 62 authors on 369 texts by those authors, my program had an accuracy of 75%, while the right author ranked in the top three authors 86% of the time. The closeness of a few authors shows that StyleChooser does a better job of differentiating between styles in a broader sense than between authors. A program that differentiates between styles could be used for style differentiation, style based searching, and even better human/computer interaction. Thesis Supervisor: Patrick Winston Title: Ford Professor of Artificial Intelligence and Computer Science

[1]  Robert Bosch,et al.  Separating Hyperplanes and the Authorship of the Disputed Federalist Papers , 1998 .

[2]  藤田 佳子 ヘンリー・デイヴィッド・ソローのA Week on the Concord and Merrimack Rivers : からの逸脱 , 2004 .

[3]  D. W. Foster Author Unknown: On the Trail of Anonymous , 2000 .

[4]  Daniel Defoe,et al.  From London to Land's End , 2002 .

[5]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[6]  G. Macdonald At the Back of the North Wind , 1871 .

[7]  J. London Love Of Life And Other Stories , 2022 .

[8]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[9]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[10]  Michal Ephratt Authorship attribution - the case of lexical innovations , 1997 .

[11]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[12]  Victor Appleton,et al.  Tom Swift and His Air Glider , 2022 .

[13]  Ian H. Witten,et al.  Lexical attraction for text compression , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[14]  A. Q. Morton,et al.  Analysing for authorship : a guide to the cusum technique , 1996 .

[15]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[16]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[17]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[18]  Deniz Yuret,et al.  Discovery of linguistic relations using lexical attraction , 1998, ArXiv.

[19]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[20]  Colin Martindale,et al.  On the utility of content analysis in author attribution:The Federalist , 1995, Comput. Humanit..

[21]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .