Text clustering with styles

This thesis mainly describes the author clustering problem where, based on a set of n texts, the goal is to determine the number k of distinct authors and regroup the texts into k classes according to their author. We iteratively build a stable and simple model for text clustering with styles. We start by designing a measure reflecting the (un)certainty of the proposed decision such that every decision comes along with a confidence of correctness instead of only giving a single answer. Afterwards, we link those pairs of texts where we see an indication of a shared authorship and have enough evidence that the same person has written them. Finally, after checking every text tuple, if we can link them together, we build the final clusters based on a strategy using a distance of probability distributions. Employing a dynamic threshold, we can choose the smallest relative distance values to detect a common origin of the texts. While in our study we mostly focus on the creation of simple methods, investigating more complex schemes leads to interesting findings. We evaluate distributed language representations and compare them to several state-of-the-art methods for authorship attribution. This comparison allows us to demonstrate that not every approach excels in every situation and that the deep learning methods might be sensitive to parameter settings. The most similar observations (or the category with the smallest distance) to the sample in question usually determines the proposed answers. We test multiple inter-textual distance functions in theoretical and empirical tests and show that the Tanimoto and Matusita distances respect all theoretical properties. Both of them perform well in empirical tests, but the Canberra and Clark measures are even better suited even though they do not fulfill all the requirements. Overall, we can note that the popular Cosine function neither satisfies all the conditions nor works notably well. Furthermore, we see that reducing the text representation not only decreases the runtime but can also increase the performance by ignoring spurious features. Our model can choose the characteristics that are the most relevant to the text in question and can analyze the author adequately. We apply our systems in various natural languages belonging to a variety of language families and in multiple text genres. With the flexible feature selection, our systems achieve reliable results in any of the tested settings.

[1]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[2]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[3]  Hugh Craig,et al.  Shakespeare, Computers, and the Mystery of Authorship: Plays in the corpus , 2009 .

[4]  Mirco Kocher,et al.  UniNE at CLEF 2015 Author Profiling: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[5]  J. F. Burrows,et al.  Not Unles You Ask Nicely: The Interpretative Nexus Between Analysis and Information , 1992 .

[6]  Peter Tiersma,et al.  Author Identification in American Courts , 2004 .

[7]  Cyril Labbé,et al.  A Tool for Literary Studies: Intertextual Distance and Tree Classification , 2005, Lit. Linguistic Comput..

[8]  Philip S. Yu,et al.  Empirical Evaluation of Profile Characteristics for Gender Classification on Twitter , 2013, 2013 12th International Conference on Machine Learning and Applications.

[9]  Jacques Savoy,et al.  Regroupement d'auteurs : Qui a écrit cet ensemble de romans ? , 2017, CORIA.

[10]  G. Caldarelli,et al.  The spreading of misinformation online , 2016, Proceedings of the National Academy of Sciences.

[11]  Jacques Savoy,et al.  UniNE at CLEF 2017: Author Clustering , 2017, CLEF.

[12]  Paul A. Watters,et al.  Evaluating authorship distance methods using the positive Silhouette coefficient , 2012, Natural Language Engineering.

[13]  Guido Caldarelli,et al.  Science vs Conspiracy: Collective Narratives in the Age of Misinformation , 2014, PloS one.

[14]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[15]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[16]  Yaron Winter Determining if Two Documents are by the Same Author , 2013 .

[17]  Randall Munroe Thing Explainer: Complicated Stuff in Simple Words , 2001 .

[18]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[19]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[20]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[21]  Desislava Zhekova,et al.  CAPS: A Cross-genre Author Profiling System , 2016, CLEF.

[22]  Mike Kestemont,et al.  Computational authorship verification method attributes a new work to a major 2nd century African author , 2015, J. Assoc. Inf. Sci. Technol..

[23]  Azucena Montes Rendón,et al.  Tweets Classification using Corpus Dependent Tags, Character and POS N-grams , 2015, CLEF.

[24]  Dominique Labbé,et al.  Experiments on authorship attribution by intertextual distance in english* , 2007, J. Quant. Linguistics.

[25]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[26]  J. Pennebaker,et al.  The Secret Life of Pronouns , 2003, Psychological science.

[27]  Jacques Savoy,et al.  UniNE at CLEF 2017: Author Profiling Reasoning , 2017, CLEF.

[28]  Jose Nilo G. Binongo,et al.  The application of principal component analysis to stylometry , 1999 .

[29]  Susan Conrad,et al.  Register, Genre, and Style: Registers, genres, and styles: fundamental varieties of language , 2009 .

[30]  Youssef Iraqi,et al.  A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF) , 2014, CLEF.

[31]  Jacques Savoy,et al.  Distance measures in author profiling , 2017, Information Processing & Management.

[32]  J. M. Hughes,et al.  Quantitative patterns of stylistic influence in the evolution of literature , 2012, Proceedings of the National Academy of Sciences.

[33]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[34]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[35]  Matthias Hagen,et al.  Author Obfuscation: Attacking the State of the Art in Authorship Verification , 2016, CLEF.

[36]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[37]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[38]  John Burrows,et al.  All the Way Through: Testing for Authorship in Different Frequency Strata , 2007, Lit. Linguistic Comput..

[39]  Benno Stein,et al.  Overview of PAN'16 - New Challenges for Authorship Analysis: Cross-Genre Profiling, Clustering, Diarization, and Obfuscation , 2016, CLEF.

[40]  Jacques Savoy,et al.  Comparative evaluation of term selection functions for authorship attribution , 2015, Digit. Scholarsh. Humanit..

[41]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[42]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[43]  Jacques Savoy,et al.  Authorship Attribution Based on Specific Vocabulary , 2012, TOIS.

[44]  Maciej Eder,et al.  Does size matter? Authorship attribution, small samples, big problem , 2015, Digit. Scholarsh. Humanit..

[45]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[46]  Ron Mengelers,et al.  The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements , 2012, PloS one.

[47]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[48]  J. Pennebaker The Secret Life of Pronouns: What Our Words Say About Us , 2011 .

[49]  Jacques Savoy,et al.  A simple and efficient algorithm for authorship verification , 2017, J. Assoc. Inf. Sci. Technol..

[50]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[51]  Brendan T. O'Connor,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics , 2011 .

[52]  Austin F. Frank,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2010 .

[53]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[54]  Gene Tsudik,et al.  Exploring Linkability of User Reviews , 2012, ESORICS.

[55]  Jacques Savoy,et al.  Text representation strategies: An example with the State of the union addresses , 2016, J. Assoc. Inf. Sci. Technol..

[56]  Jacques Savoy,et al.  Estimating the probability of an authorship attribution , 2016, J. Assoc. Inf. Sci. Technol..

[57]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[58]  Cyril Labbé,et al.  A Tool for Literary Studies , 2008 .

[59]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[60]  Maarten Sap,et al.  Developing Age and Gender Predictive Lexica over Social Media , 2014, EMNLP.

[61]  John Olsson,et al.  Forensic Linguistics: Second Edition , 2008 .

[62]  Benno Stein,et al.  Clustering by Authorship Within and Across Documents , 2016, CLEF.

[63]  Bradley Kjell,et al.  Authorship Determination Using Letter Pair Frequency Features with Neural Network Classifiers , 1995 .

[64]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[65]  Benno Stein,et al.  Overview of the PAN/CLEF 2015 Evaluation Lab , 2015, CLEF.

[66]  Mihaela Juganaru-Mathieu,et al.  UJM at CLEF in Author Identification Notebook for PAN at CLEF 2014 , 2014, CLEF.

[67]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[68]  Mónica Bécue-Bertaut,et al.  How scientific literature has been evolving over the time? A novel statistical approach using tracking verbal-based methods , 2016, ArXiv.

[69]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[70]  Hugo Jair Escalante,et al.  Using Intra-Profile Information for Author Profiling , 2014, CLEF.

[71]  Who Wrote Shakespeare? , 1996 .

[72]  Paolo Rosso,et al.  On the impact of emotions on author profiling , 2016, Inf. Process. Manag..

[73]  Amir H. Darooneh,et al.  The complex networks approach for authorship attribution of books , 2012 .

[74]  Jacques Savoy,et al.  Analysis of the Style and the Rhetoric of the American Presidents Over Two Centuries , 2017, Glottometrics.

[75]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[76]  H. Love Attributing Authorship: An Introduction , 2002 .

[77]  Matthew Purver,et al.  Twitter Language Use Reflects Psychological Differences between Democrats and Republicans , 2015, PloS one.

[78]  Mirco Kocher UniNE at CLEF 2016: Author Clustering , 2016, CLEF.

[79]  David I. Holmes,et al.  The diary of a public man: a case study in traditional and non-traditional authorship attribution , 2010, Lit. Linguistic Comput..

[80]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[81]  Thamar Solorio,et al.  A Simple Approach to Author Profiling in MapReduce , 2014, CLEF.

[82]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[83]  Jacques Savoy,et al.  UniNE at CLEF 2016: Author Profiling , 2016, CLEF.

[84]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[85]  Jacques Savoy,et al.  Author Clustering Using SPATIUM , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[86]  Jacques Savoy,et al.  UniNE at CLEF 2015 Author Identification: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[87]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[88]  George Kingsley Zipf,et al.  The Psychobiology of Language , 2022 .

[89]  Thomas Merriam,et al.  Shakespeare, Fletcher, and the Two Noble Kinsmen , 1994 .

[90]  Anselmo Peñas,et al.  A Simple Measure to Assess Non-response , 2011, ACL.

[91]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[92]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[93]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[94]  Richard A. Harshman,et al.  Indexing by latent semantic indexing analysis , 1990 .

[95]  Matthew L. Jockers,et al.  Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification , 2008, Lit. Linguistic Comput..

[96]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[97]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .