Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

[1]  Dmitriy Fradkin,et al.  Bayesian Multinomial Logistic Regression for Author Identification , 2005, AIP Conference Proceedings.

[2]  Ildar Z. Batyrshin,et al.  Complete Syntactic N-grams as Style Markers for Authorship Attribution , 2014, MICAI.

[3]  Maxwell B. Schwartz,et al.  An Examination of Cross-Domain Authorship Attribution Techniques , 2016 .

[4]  Xiaoyong Du,et al.  Learning Document Embeddings by Predicting N-grams for Sentiment Classification of Long Movie Reviews , 2015, ArXiv.

[5]  Moshe Koppel,et al.  Automatically Identifying Pseudepigraphic Texts , 2013, EMNLP.

[6]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[7]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[8]  Paolo Rosso,et al.  Cross-Topic Authorship Attribution: Will Out-Of-Topic Data Help? , 2014, COLING.

[9]  Grigori Sidorov,et al.  Application of the distributed document representation in the authorship attribution task for small corpora , 2017, Soft Comput..

[10]  Steven Bethard,et al.  Not All Character N-grams Are Created Equal: A Study in Authorship Attribution , 2015, NAACL.

[11]  Efstathios Stamatatos,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[12]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[13]  Darnes Vilariño Ayala,et al.  Automatic Authorship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs , 2016, Sensors.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[16]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[17]  Grigori Sidorov,et al.  Author Verification Using Syntactic N-grams: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[18]  W. Daelemans,et al.  Cross-Genre Authorship Verification Using Unmasking , 2012, English Studies.

[19]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[20]  Efstathios Stamatatos,et al.  On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[21]  Hugo Jair Escalante,et al.  Local Histograms of Character N-grams for Authorship Attribution , 2011, ACL.

[22]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[23]  M. Coulthard On Admissible Linguistic Evidence , 2013 .

[24]  Grigori Sidorov,et al.  A Graph Based Authorship Identification Approach: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[25]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[26]  Efstathios Stamatatos,et al.  Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing , 2017, CICLing.

[27]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[28]  Matthias Hagen,et al.  Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval , 2016, ECIR.

[29]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.