Character-level and Multi-channel Convolutional Neural Networks for Large-scale Authorship Attribution

Convolutional neural networks (CNNs) have demonstrated superior capability for extracting information from raw signals in computer vision. Recently, character-level and multi-channel CNNs have exhibited excellent performance for sentence classification tasks. We apply CNNs to large-scale authorship attribution, which aims to determine an unknown text's author among many candidate authors, motivated by their ability to process character-level signals and to differentiate between a large number of classes, while making fast predictions in comparison to state-of-the-art approaches. We extensively evaluate CNN-based approaches that leverage word and character channels and compare them against state-of-the-art methods for a large range of author numbers, shedding new light on traditional approaches. We show that character-level CNNs outperform the state-of-the-art on four out of five datasets in different domains. Additionally, we present the first application of authorship attribution to reddit.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[3]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[4]  Louise Guthrie,et al.  Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation , 2008, LREC.

[5]  Ruslan Salakhutdinov,et al.  A Multiplicative Model for Learning Distributed Text-Based Attribute Representations , 2014, NIPS.

[6]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[7]  Roy Schwartz,et al.  Authorship Attribution of Micro-Messages , 2013, EMNLP.

[8]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[9]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[10]  Paul A. Watters,et al.  Unsupervised authorship analysis of phishing webpages , 2012, 2012 International Symposium on Communications and Information Technologies (ISCIT).

[11]  Byron C. Wallace,et al.  Sparse, Contextually Informed Models for Irony Detection: Exploiting User Communities, Entities and Sentiment , 2015, ACL.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[14]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[15]  Cícero Nogueira dos Santos,et al.  Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[16]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[17]  Richard Dazeley,et al.  Authorship Attribution for Twitter in 140 Characters or Less , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[18]  Matthias Hagen,et al.  Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval , 2016, ECIR.

[19]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[20]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[21]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[22]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[24]  Ingrid Zukerman,et al.  Authorship Attribution with Latent Dirichlet Allocation , 2011, CoNLL.

[25]  Cícero Nogueira dos Santos,et al.  Boosting Named Entity Recognition with Neural Character Embeddings , 2015, NEWS@ACL.

[26]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[29]  Efstathios Stamatatos,et al.  Author Identification Using Imbalanced and Limited Training Texts , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[30]  Steven Bethard,et al.  Not All Character N-grams Are Created Equal: A Study in Authorship Attribution , 2015, NAACL.