Learning Universal Authorship Representations

Determining whether two documents were composed by the same author, also known as authorship verification, has traditionally been tackled using statistical methods. Recently, authorship representations learned using neural networks have been found to outperform alternatives, particularly in large-scale settings involving hundreds of thousands of authors. But do such representations learned in a particular domain transfer to other domains? Or are these representations inherently entangled with domain-specific features? To study these questions, we conduct the first large-scale study of cross-domain transfer for authorship verification considering zero-shot transfers involving three disparate domains: Amazon reviews, fanfiction short stories, and Reddit comments. We find that although a surprising degree of transfer is possible between certain domains, it is not so successful between others. We examine properties of these domains that influence generalization and propose simple but effective methods to improve transfer.

[1]  Robert M. Nickel,et al.  Explainable Authorship Verification in Social Media via Attention-based Similarity Learning , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[2]  Paolo Rosso,et al.  Convolutional Neural Networks for Authorship Attribution of Short Texts , 2017, EACL.

[3]  Matthias Hagen,et al.  The Importance of Suppressing Domain Style in Authorship Analysis , 2020, ArXiv.

[4]  Alexander D'Amour,et al.  Underspecification Presents Challenges for Credibility in Modern Machine Learning , 2020, J. Mach. Learn. Res..

[5]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[6]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[7]  Robert M. Nickel,et al.  Deep Bayes Factor Scoring for Authorship Verification , 2020, CLEF.

[8]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[9]  Efstathios Stamatatos,et al.  Masking topic‐related information to enhance authorship attribution , 2018, J. Assoc. Inf. Sci. Technol..

[10]  P. Motlícek,et al.  BertAA : BERT fine-tuning for Authorship Attribution , 2020, ICON.

[11]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[12]  Efstathios Stamatatos,et al.  Cross-Domain Authorship Attribution Using Pre-trained Language Models , 2020, AIAI.

[13]  Marcus Bishop,et al.  Learning Invariant Representations of Social Media Users , 2019, EMNLP.

[14]  Nicholas Andrews,et al.  A Deep Metric Learning Approach to Account Linking , 2021, NAACL.

[15]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[16]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[17]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[18]  Mark Dras,et al.  Siamese Networks for Large-Scale Author Identification , 2019, Comput. Speech Lang..

[19]  Ariel Stolerman,et al.  Breaking the Closed-World Assumption in Stylometric Authorship Attribution , 2014, IFIP Int. Conf. Digital Forensics.

[20]  Martin Potthast,et al.  Overview of PAN 2020: Authorship Verification, Celebrity Profiling, Profiling Fake News Spreaders on Twitter, and Style Change Detection , 2020, CLEF.

[21]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.