Stylometric Authorship Attribution of Collaborative Documents

Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.

[1]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[2]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[3]  Nachum Dershowitz,et al.  Unsupervised Decomposition of a Document into Authorial Components , 2011, ACL.

[4]  Gene Tsudik,et al.  Fighting authorship linkability with crowdsourcing , 2014, COSN '14.

[5]  Rachel Greenstadt,et al.  Blogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution , 2016, Proc. Priv. Enhancing Technol..

[6]  Thamar Solorio,et al.  Sockpuppet Detection in Wikipedia: A Corpus of Real-World Deceptive Writing for Linking Identities , 2013, LREC.

[7]  Grigorios Tsoumakas,et al.  Random K-labelsets for Multilabel Classification , 2022 .

[8]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[9]  Grigorios Tsoumakas,et al.  Effective and Efficient Multilabel Classification in Domains with Large Number of Labels , 2008 .

[10]  Stephen Macke Deep Sentence-Level Authorship Attribution , 2015 .

[11]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[12]  Rachel Greenstadt,et al.  Poster: Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[13]  Rachel Greenstadt,et al.  Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments , 2017, ICSE.

[14]  Ling Huang,et al.  What You Submit Is Who You Are: A Multimodal Approach for Deanonymizing Scientific Publications , 2015, IEEE Transactions on Information Forensics and Security.

[15]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[16]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[17]  Ariel Stolerman,et al.  Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization , 2012, Privacy Enhancing Technologies.

[18]  George M. Mohay,et al.  Identifying the authors of suspect email , 2001 .

[19]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[20]  Moshe Koppel,et al.  A generic unsupervised method for decomposing multi-author documents , 2013, J. Assoc. Inf. Sci. Technol..

[21]  Yejin Choi,et al.  Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis , 2011, ACL.

[22]  David Fifield,et al.  Unsupervised authorship attribution , 2015, ArXiv.

[23]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.