论文信息 - Document embeddings learned on various types of n-grams for cross-topic authorship attribution - 字舞流文

Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

Helena Gómez-Adorno | Grigori Sidorov | David Pinto | Juan Pablo Posadas-Durán

[1] Dmitriy Fradkin,et al. Bayesian Multinomial Logistic Regression for Author Identification , 2005, AIP Conference Proceedings.

[2] Ildar Z. Batyrshin,et al. Complete Syntactic N-grams as Style Markers for Authorship Attribution , 2014, MICAI.

[3] Maxwell B. Schwartz,et al. An Examination of Cross-Domain Authorship Attribution Techniques , 2016 .

[4] Xiaoyong Du,et al. Learning Document Embeddings by Predicting N-grams for Sentiment Classification of Long Movie Reviews , 2015, ArXiv.

[5] Moshe Koppel,et al. Automatically Identifying Pseudepigraphic Texts , 2013, EMNLP.

[6] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[7] Moshe Koppel,et al. Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[8] Paolo Rosso,et al. Cross-Topic Authorship Attribution: Will Out-Of-Topic Data Help? , 2014, COLING.

[9] Grigori Sidorov,et al. Application of the distributed document representation in the authorship attribution task for small corpora , 2017, Soft Comput..

[10] Steven Bethard,et al. Not All Character N-grams Are Created Equal: A Study in Authorship Attribution , 2015, NAACL.

[11] Efstathios Stamatatos,et al. Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[12] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[13] Darnes Vilariño Ayala,et al. Automatic Authorship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs , 2016, Sensors.

[14] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15] Geoffrey Zweig,et al. Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[16] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[17] Grigori Sidorov,et al. Author Verification Using Syntactic N-grams: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[18] W. Daelemans,et al. Cross-Genre Authorship Verification Using Unmasking , 2012, English Studies.

[19] Efstathios Stamatatos,et al. A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[20] Efstathios Stamatatos,et al. On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[21] Hugo Jair Escalante,et al. Local Histograms of Character N-grams for Authorship Attribution , 2011, ACL.

[22] Hal Daumé,et al. Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[23] M. Coulthard. On Admissible Linguistic Evidence , 2013 .

[24] Grigori Sidorov,et al. A Graph Based Authorship Identification Approach: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[25] Phil Blunsom,et al. A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[26] Efstathios Stamatatos,et al. Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing , 2017, CICLing.

[27] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[28] Matthias Hagen,et al. Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval , 2016, ECIR.

[29] Hsinchun Chen,et al. Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.