Masking topic‐related information to enhance authorship attribution

Authorship attribution attempts to reveal the authors of documents. In recent years, research in this field has grown rapidly. However, the performance of state‐of‐the‐art methods is heavily affected when text of known authorship and texts under investigation differ in topic and/or genre. So far, it is not clear how to quantify the personal style of authors in a way that is not affected by topic shifts or genre variations. In this paper, a set of text distortion methods are used attempting to mask topic‐related information. These methods transform the input texts into a more topic‐neutral form while maintaining the structure of documents associated with the personal style of the author. Using a controlled corpus that includes a fine‐grained range of topics and genres it is demonstrated how the proposed approach can be combined with existing authorship attribution methods to enhance their performance in very challenging tasks, especially in cross‐topic attribution. We also examine cross‐genre attribution and the most challenging, yet realistic, cross‐topic‐and‐genre attribution scenarios and show how the proposed techniques should be tuned to enhance performance in these tasks. Finally, we demonstrate that there are important differences in attribution effectiveness when either conversational genres, nonconversational genres, or a mix of them are considered.

[1]  Efstathios Stamatatos,et al.  On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[2]  Manuel Cebrián,et al.  Reducing the Loss of Information through Annealing Text Distortion , 2011, IEEE Transactions on Knowledge and Data Engineering.

[3]  Steven Bethard,et al.  Domain Adaptation for Authorship Attribution: Improved Structural Correspondence Learning , 2016, ACL.

[4]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[5]  Efstathios Stamatatos,et al.  Authorship Attribution Using Text Distortion , 2017, EACL.

[6]  Rebecca L Johnson,et al.  The importance of the first and last letter in words during sentence reading. , 2012, Acta psychologica.

[7]  George K. Mikros,et al.  Investigating Topic Influence in Authorship Attribution , 2007, PAN.

[8]  W. Daelemans,et al.  Cross-Genre Authorship Verification Using Unmasking , 2012, English Studies.

[9]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features , 2007, J. Assoc. Inf. Sci. Technol..

[10]  Barbara Plank,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 .

[11]  Walter Daelemans,et al.  Authorship Attribution and Verification with Many Authors and Limited Data , 2008, COLING.

[12]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[13]  Cor J. Veenman,et al.  Forensic Authorship Attribution Using Compression Distances to Prototypes , 2009, IWCF.

[14]  Mike Kestemont,et al.  Function Words in Authorship Attribution. From Black Magic to Theory? , 2014, CLfL@EACL.

[15]  Jade Goldstein-Stewart,et al.  Person Identification from Text and Speech Genre Samples , 2009, EACL.

[16]  David Camacho,et al.  Improving NCD accuracy by combining document segmentation and document distortion , 2013, Knowledge and Information Systems.

[17]  Jacques Savoy,et al.  Authorship attribution based on a probabilistic topic model , 2013, Inf. Process. Manag..

[18]  Ingrid Zukerman,et al.  Authorship Attribution with Topic Models , 2014, CL.

[19]  Roy Schwartz,et al.  Authorship Attribution of Micro-Messages , 2013, EMNLP.

[20]  Craig H. Martell,et al.  Author Attribution Evaluation with Novel Topic Cross-validation , 2010, KDIR.

[21]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[22]  Moshe Koppel,et al.  Automatically Identifying Pseudepigraphic Texts , 2013, EMNLP.

[23]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[24]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[25]  Hans van Halteren,et al.  Linguistic Profiling for Authorship Recognition and Verification , 2004, ACL.

[26]  Jakob Grue Simonsen,et al.  Lost in Translation: Authorship Attribution using Frame Semantics , 2011, ACL.

[27]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[28]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[29]  Yejin Choi,et al.  Domain Independent Authorship Attribution without Domain Adaptation , 2011, RANLP.

[30]  Efstathios Stamatatos,et al.  Authorship Attribution for Social Media Forensics , 2017, IEEE Transactions on Information Forensics and Security.

[31]  Steven Bethard,et al.  Not All Character N-grams Are Created Equal: A Study in Authorship Attribution , 2015, NAACL.

[32]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[33]  Mike Kestemont,et al.  Computational authorship verification method attributes a new work to a major 2nd century African author , 2015, J. Assoc. Inf. Sci. Technol..

[34]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[35]  Gene Tsudik,et al.  Exploring Linkability of User Reviews , 2012, ESORICS.

[36]  Paolo Rosso,et al.  Cross-Topic Authorship Attribution: Will Out-Of-Topic Data Help? , 2014, COLING.

[37]  Efstathios Stamatatos,et al.  Author Identification Using Imbalanced and Limited Training Texts , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[38]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[39]  David Camacho,et al.  Is the contextual information relevant in text clustering by compression? , 2012, Expert Syst. Appl..

[40]  David J. Harper,et al.  Using compression based language models for text categorization. , 2003 .

[41]  Shlomo Argamon,et al.  Author Identification on the Large Scale , 2005 .

[42]  Magdalena Jankowska,et al.  Author Verification Using Common N-Gram Profiles of Text Documents , 2014, COLING.

[43]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[44]  Sharon M. Thomas,et al.  Assessing the importance of letter pairs in initial, exterior, and interior positions in reading. , 2003, Journal of experimental psychology. Learning, memory, and cognition.

[45]  M. Coulthard On Admissible Linguistic Evidence , 2013 .

[46]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[47]  Moshe Koppel,et al.  Determining if two documents are written by the same author , 2014, J. Assoc. Inf. Sci. Technol..

[48]  Matthias Hagen,et al.  Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval , 2016, ECIR.

[49]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[50]  Hugo Jair Escalante,et al.  Local Histograms of Character N-grams for Authorship Attribution , 2011, ACL.

[51]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..