Authorship Attribution Through Function Word Adjacency Networks

A method for authorship attribution based on function word adjacency networks (WANs) is introduced. Function words are parts of speech that express grammatical relationships between other words but do not carry lexical meaning on their own. In the WANs in this paper, nodes are function words and directed edges from a source function word to a target function word stand in for the likelihood of finding the latter in the ordered vicinity of the former. WANs of different authors can be interpreted as transition probabilities of a Markov chain and are therefore compared in terms of their relative entropies. Optimal selection of WAN parameters is studied and attribution accuracy is benchmarked across a diverse pool of authors and varying text lengths. This analysis shows that, since function words are independent of content, their use tends to be specific to an author and that the relational data captured by function WANs is a good summary of stylometric fingerprints. Attribution accuracy is observed to exceed the one achieved by methods that rely on word frequencies alone. Further combining WANs with methods that rely on word frequencies, results in larger attribution accuracy, indicating that both sources of information encode different aspects of authorial styles.

[1]  Justin Zobel,et al.  Effective and Scalable Authorship Attribution Using Function Words , 2005, AIRS.

[2]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[3]  David L. Hoover,et al.  Delta Prime? , 2004, Lit. Linguistic Comput..

[4]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[5]  Efstathios Stamatatos,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[6]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[7]  Santiago Segarra,et al.  Authorship attribution using function words adjacency networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Dmitry V. Khmelev,et al.  Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..

[9]  Alan B Farmer,et al.  Early Modern Digital Scholarship and DEEP: Database of Early English Playbooks , 2008 .

[10]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[11]  Fiona J. TweedieNovember Using Markov Chains for Identification of Writers , 2002 .

[12]  Mehryar Mohri,et al.  A Machine Learning Framework for Spoken-Dialog Classification , 2008 .

[13]  Maurizio Vichi,et al.  Studies in Classification Data Analysis and knowledge Organization , 2011 .

[14]  David I. Holmes,et al.  Vocabulary Richness and the Prophetic Voice , 1991 .

[15]  Justin Zobel,et al.  Using Relative Entropy for Authorship Attribution , 2006, AIRS.

[16]  M. Kendall The Statistical Study of Literary Vocabulary , 1944, Nature.

[17]  David L. Hoover,et al.  Another Perspective on Vocabulary Richness , 2003, Comput. Humanit..

[18]  Dmitry V. Khmelev,et al.  Using Markov Chains for Identification of Writer , 2001, Lit. Linguistic Comput..

[19]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[20]  Fady Alajaji,et al.  The Kullback-Leibler divergence rate between Markov sources , 2004, IEEE Transactions on Information Theory.

[21]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[22]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[23]  T. W. Anderson,et al.  Statistical Inference about Markov Chains , 1957 .

[24]  Jacques Savoy,et al.  Authorship attribution based on a probabilistic topic model , 2013, Inf. Process. Manag..

[25]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[26]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[27]  Manuel Montes-y-Gómez,et al.  Modality Specific Meta Features for Authorship Attribution in Web Forum Posts , 2011, IJCNLP.

[28]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[29]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[30]  Mathukumalli Vidyasagar,et al.  The 4M (Mixed Memory Markov Model) Algorithm for Finding Genes in Prokaryotic Genomes , 2008, IEEE Transactions on Automatic Control.

[31]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[32]  Bei Yu,et al.  Function Words for Chinese Authorship Attribution , 2012, CLfL@NAACL-HLT.

[33]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[34]  Jean C. Walrand,et al.  Relative entropy between Markov transition rate matrices , 1993, IEEE Trans. Inf. Theory.

[35]  David I. Holmes,et al.  Feature-Finding for Text Classification , 1996 .

[36]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[37]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[38]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[39]  Thomas S. Huang,et al.  Non-frontal view facial expression recognition based on ergodic hidden Markov model supervectors , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[40]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[41]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[42]  Derek Abbott,et al.  Automated Authorship Attribution Using Advanced Signal Classification Techniques , 2013, PloS one.

[43]  Tim Grant,et al.  Quantifying evidence in forensic authorship analysis , 2007 .

[44]  Ingrid Zukerman,et al.  Authorship Attribution with Latent Dirichlet Allocation , 2011, CoNLL.

[45]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[46]  John F. Burrows,et al.  ‘An ocean where each kind. . .’: Statistical analysis and some major determinants of literary style , 1989, Comput. Humanit..

[47]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[48]  Ido Dagan,et al.  Feature instability as a criterion for selecting potential style markers , 2006, J. Assoc. Inf. Sci. Technol..

[49]  Mark Steyvers,et al.  Detecting authorship deception: a supervised machine learning approach using author writeprints , 2012, Lit. Linguistic Comput..

[50]  C. E. Veni Madhavan,et al.  Stopword Graphs and Authorship Attribution in Text Corpora , 2009, 2009 IEEE International Conference on Semantic Computing.

[51]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[52]  Hans van Halteren,et al.  New Machine Learning Methods Demonstrate the Existence of a Human Stylome , 2005, J. Quant. Linguistics.