An Ensemble Approach to Cross-Domain Authorship Attribution

This paper presents an ensemble approach to cross-domain authorship attribution that combines predictions made by three independent classifiers, namely, standard character n-grams, character n-grams with non-diacritic distortion and word n-grams. Our proposal relies on variable-length n-gram models and multinomial logistic regression to select the prediction of highest probability among the three models as the output for the task. The present approach is compared against a number of baseline systems, and we report results based on both the PAN-CLEF 2018 test data, and on a new corpus of song lyrics in English and Portuguese.

[1]  Mike Kestemont,et al.  Function Words in Authorship Attribution. From Black Magic to Theory? , 2014, CLfL@EACL.

[2]  Helena Gómez-Adorno,et al.  Document embeddings learned on various types of n-grams for cross-topic authorship attribution , 2018, Computing.

[3]  Ivandré Paraboni,et al.  Multi-channel Open-set Cross-domain Authorship Attribution , 2019, CLEF.

[5]  Hasan Ogul,et al.  Evaluating text features for lyrics-based songwriter prediction , 2015, 2015 IEEE 19th International Conference on Intelligent Engineering Systems (INES).

[6]  I. Leuthäusser Neural network methods , 1991 .

[7]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection , 2018, CLEF.

[8]  Efstathios Stamatatos,et al.  Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing , 2017, CICLing.

[9]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[10]  Ivandré Paraboni,et al.  EACH-USP Ensemble Cross-domain Authorship Attribution: Notebook for PAN at CLEF 2018 , 2018, CLEF.

[11]  Benno Stein,et al.  Recent Trends in Digital Text Forensics and Its Evaluation - Plagiarism Detection, Author Identification, and Author Profiling , 2013, CLEF.

[12]  Benno Stein,et al.  Overview of PAN'17 - Author Identification, Author Profiling, and Author Obfuscation , 2017, CLEF.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Steven Bethard,et al.  Not All Character N-grams Are Created Equal: A Study in Authorship Attribution , 2015, NAACL.

[15]  Efstathios Stamatatos,et al.  Authorship Attribution for Social Media Forensics , 2017, IEEE Transactions on Information Forensics and Security.

[16]  Benno Stein,et al.  Overview of PAN'16 - New Challenges for Authorship Analysis: Cross-Genre Profiling, Clustering, Diarization, and Obfuscation , 2016, CLEF.

[17]  Paolo Rosso,et al.  Convolutional Neural Networks for Authorship Attribution of Short Texts , 2017, EACL.

[18]  Efstathios Stamatatos,et al.  Authorship Attribution Using Text Distortion , 2017, EACL.