The Topic Confusion Task: A Novel Scenario for Authorship Attribution

Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether new, unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by a failure to capture authorship writing style or by a topic shift. Motivated by this, we propose the topic confusion task where we switch the author-topic configuration between the training and testing sets. This setup allows us to distinguish two types of errors: those caused by the topic shift and those caused by the features’ inability to capture the writing styles. We show that stylometric features with part-of-speech tags are the least susceptible to topic variations. We further show that combining them with other features leads to significantly lower topic confusion and higher attribution accuracy. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task and are surpassed by simple features such as word-level n-grams.

[1]  Helena Gómez-Adorno,et al.  Document embeddings learned on various types of n-grams for cross-topic authorship attribution , 2018, Computing.

[2]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[3]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..

[4]  Damon L. Woodard,et al.  What represents “style” in authorship attribution? , 2018, COLING.

[5]  Jade Goldstein-Stewart,et al.  Person Identification from Text and Speech Genre Samples , 2009, EACL.

[6]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[7]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[8]  P. Juola,et al.  Mode Effects’ Challenge to Authorship Attribution , 2021, EACL.

[9]  Ines Rehbein,et al.  Authorship Attribution with Convolutional Neural Networks and POS-Eliding , 2017 .

[10]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[11]  Benjamin C. M. Fung,et al.  A Visualizable Evidence-Driven Approach for Authorship Attribution , 2015, TSEC.

[12]  Angeliki Lazaridou,et al.  Cross-Language Authorship Attribution , 2014, LREC.

[13]  Efstathios Stamatatos,et al.  Authorship Attribution Using Text Distortion , 2017, EACL.

[14]  Efstathios Stamatatos,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[15]  Benjamin C. M. Fung,et al.  Learning Stylometric Representations for Authorship Analysis , 2016, IEEE Transactions on Cybernetics.

[16]  Mark Stevenson,et al.  Continuous N-gram Representations for Authorship Attribution , 2017, EACL.

[17]  Eugénio C. Oliveira,et al.  'twazn me!!! ;(' Automatic Authorship Analysis of Micro-Blogging Messages , 2011, NLDB.

[18]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  R. Layton,et al.  Authorship Attribution of IRC Messages Using Inverse Author Frequency , 2012, 2012 Third Cybercrime and Trustworthy Computing Workshop.

[21]  George K. Mikros,et al.  Investigating Topic Influence in Authorship Attribution , 2007, PAN.

[22]  Rachel Greenstadt,et al.  Blogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution , 2016, Proc. Priv. Enhancing Technol..

[23]  Efstathios Stamatatos,et al.  Cross-Domain Authorship Attribution Using Pre-trained Language Models , 2020, AIAI.

[24]  Efstathios Stamatatos,et al.  On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[25]  Walter Daelemans,et al.  The effect of author set size and data size in authorship attribution , 2011, Lit. Linguistic Comput..

[26]  Hongyu Guo,et al.  Syntax Encoding with Application in Authorship Attribution , 2018, EMNLP.

[27]  Ivandré Paraboni,et al.  An Ensemble Approach to Cross-Domain Authorship Attribution , 2019, CLEF.

[28]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[29]  David Camacho,et al.  Is the contextual information relevant in text clustering by compression? , 2012, Expert Syst. Appl..

[30]  Helena Gómez-Adorno,et al.  Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus , 2017, CLEF.

[31]  Paolo Rosso,et al.  Convolutional Neural Networks for Authorship Attribution of Short Texts , 2017, EACL.

[32]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[33]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[34]  Grigori Sidorov,et al.  Application of the distributed document representation in the authorship attribution task for small corpora , 2017, Soft Comput..

[35]  Douglas Bagnall,et al.  Author Identification Using Multi-headed Recurrent Neural Networks , 2015, CLEF.

[36]  Efstathios Stamatatos,et al.  Masking topic‐related information to enhance authorship attribution , 2018, J. Assoc. Inf. Sci. Technol..

[37]  Steven Bethard,et al.  Not All Character N-grams Are Created Equal: A Study in Authorship Attribution , 2015, NAACL.

[38]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Mark J. T. Smith,et al.  Authorship Attribution Using a Neural Network Language Model , 2016, AAAI.

[41]  Benjamin C. M. Fung,et al.  A unified data mining solution for authorship analysis in anonymous textual communications , 2013, Inf. Sci..

[42]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[43]  Paolo Rosso,et al.  Cross-Topic Authorship Attribution: Will Out-Of-Topic Data Help? , 2014, COLING.

[44]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[45]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[46]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[47]  Hsinchun Chen,et al.  Visualizing Authorship for Identification , 2006, ISI.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Efstathios Stamatatos,et al.  Authorship Attribution for Social Media Forensics , 2017, IEEE Transactions on Information Forensics and Security.

[50]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[51]  Efstathios Stamatatos,et al.  N-Gram Feature Selection for Authorship Identification , 2006, AIMSA.

[52]  Efstathios Stamatatos,et al.  A transfer learning approach to cross-domain authorship attribution , 2021, Evolving Systems.

[53]  John G. Breslin,et al.  Character-level and Multi-channel Convolutional Neural Networks for Large-scale Authorship Attribution , 2016, ArXiv.

[54]  F. Mosteller,et al.  A comparative study of discrimination methods applied to the authorship of the disputed Federalist papers , 2016 .