Adversarial Stylometry in the Wild: Transferable Lexical Substitution Attacks on Author Profiling

Written language contains stylistic cues that can be exploited to automatically infer a variety of potentially sensitive author information. Adversarial stylometry intends to attack such models by rewriting an author’s text. Our research proposes several components to facilitate deployment of these adversarial attacks in the wild, where neither data nor target models are accessible. We introduce a transformerbased extension of a lexical replacement attack, and show it achieves high transferability when trained on a weakly labeled corpus— decreasing target model performance below chance. While not completely inconspicuous, our more successful attacks also prove notably less detectable by humans. Our framework therefore provides a promising direction for future privacy-preserving adversarial attacks.

[1]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[2]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[3]  Patrick Juola,et al.  Analyzing Stylometric Approaches to Author Obfuscation , 2011, IFIP Int. Conf. Digital Forensics.

[4]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[5]  Noah A. Smith Adversarial Evaluation for Models of Natural Language , 2012, ArXiv.

[6]  Yiming Yan,et al.  Surveying Stylometry Techniques and Applications , 2017, ACM Comput. Surv..

[7]  Timothy Baldwin,et al.  Towards Robust and Privacy-preserving Text Representations , 2018, ACL.

[8]  Jackie Chi Kit Cheung,et al.  Stylistic Transfer in Natural Language Generation Systems Using Recurrent Neural Networks , 2016 .

[9]  Maarten Sap,et al.  Developing Age and Gender Predictive Lexica over Social Media , 2014, EMNLP.

[10]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[11]  Malvina Nissim,et al.  Simply the Best: Minimalist System Trumps Complex Models in Author Profiling , 2017, CLEF.

[12]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[13]  Padmini Srinivasan,et al.  A Girl Has A Name: Detecting Authorship Obfuscation , 2020, ACL.

[14]  Peter Szolovits,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2020, AAAI.

[15]  Arvind Narayanan,et al.  When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries , 2015, NDSS.

[16]  Svitlana Volkova,et al.  Inferring User Political Preferences from Streaming Communications , 2014, ACL.

[17]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[18]  Matthias Hagen,et al.  Author Obfuscation: Attacking the State of the Art in Authorship Verification , 2016, CLEF.

[19]  Michael Gamon,et al.  Obfuscating Document Stylometry to Preserve Author Anonymity , 2006, ACL.

[20]  Alon Lavie,et al.  The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[21]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[22]  Albert Gatt,et al.  Best practices for the human evaluation of automatically generated text , 2019, INLG.

[23]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[24]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[25]  Amos J. Storkey,et al.  Censoring Representations with an Adversary , 2015, ICLR.

[26]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[27]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[28]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[29]  Robert Matthews,et al.  Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher , 1993 .

[30]  Kevin Knight,et al.  Obfuscating Gender in Social Media Writing , 2016, NLP+CSS@EMNLP.

[31]  Annabelle McIver,et al.  Generalised Differential Privacy for Text Document Processing , 2018, POST.

[32]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[33]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[34]  Gianluca Stringhini,et al.  Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior , 2018, ICWSM.

[35]  Taher Rahgooy,et al.  Author Obfuscation using WordNet and Language Models , 2016, CLEF.

[36]  Preslav Nakov,et al.  The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation - (Best of the Labs Track at CLEF-2017) , 2017, CLEF.

[37]  Quan Z. Sheng,et al.  Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey , 2019 .

[38]  Philip S. Yu,et al.  Empirical Evaluation of Profile Characteristics for Gender Classification on Twitter , 2013, 2013 12th International Conference on Machine Learning and Applications.

[39]  Chris Emmery,et al.  Style Obfuscation by Invariance , 2018, COLING.

[40]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[41]  Benjamin C. M. Fung,et al.  ER-AE: Differentially Private Text Generation for Authorship Anonymization , 2019, NAACL.

[42]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[43]  Walter Daelemans,et al.  Explanation in Computational Stylometry , 2013, CICLing.

[44]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[45]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[46]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[47]  F. Mosteller,et al.  A comparative study of discrimination methods applied to the authorship of the disputed Federalist papers , 2016 .

[48]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[49]  Franck Dernoncourt,et al.  Multilingual Twitter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition , 2020, LREC.

[50]  Chenchen Xu,et al.  ALTER: Auxiliary Text Rewriting Tool for Natural Language Generation , 2019, EMNLP.

[51]  Jinfeng Yi,et al.  Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples , 2018, AAAI.

[52]  Matthias Hagen,et al.  On divergence-based author obfuscation: An attack on the state of the art in statistical authorship verification , 2020, it Inf. Technol..

[53]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[54]  Bernt Schiele,et al.  A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation , 2017, USENIX Security Symposium.

[55]  Matthias Hagen,et al.  Heuristic Authorship Obfuscation , 2019, ACL.

[56]  Nan Hua,et al.  Universal Sentence Encoder for English , 2018, EMNLP.

[57]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[58]  Svitlana Volkova,et al.  Inferring Latent User Properties from Texts Published in Social Media , 2015, AAAI.

[59]  Carlisle Adams,et al.  A Classification for Privacy Techniques , 2007 .

[60]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[61]  Iryna Gurevych,et al.  Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems , 2019, NAACL.

[62]  Xirong Li,et al.  Deep Text Classification Can be Fooled , 2017, IJCAI.

[63]  Taher Rahgooy,et al.  obfuscation using WordNet and language models Notebook for PAN at CLEF 2016 , 2016 .

[64]  Grzegorz Chrupala,et al.  Representation of Linguistic Form and Function in Recurrent Neural Networks , 2016, CL.

[65]  Dirk Hovy,et al.  Personality Traits on Twitter—or—How to Get 1,500 Personality Tests in a Week , 2015, WASSA@EMNLP.

[66]  Benjamin Van Durme,et al.  I'm a Belieber: Social Roles via Self-identification and Conceptual Attributes , 2014, ACL.

[67]  Pankaj Rohatgi,et al.  Can Pseudonymity Really Guarantee Privacy? , 2000, USENIX Security Symposium.

[68]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[69]  David Vandyke,et al.  Counter-fitting Word Vectors to Linguistic Constraints , 2016, NAACL.

[70]  Svitlana Volkova,et al.  Inferring Perceived Demographics from User Emotional Tone and User-Environment Emotional Contrast , 2016, ACL.

[71]  Patrick D. McDaniel,et al.  Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples , 2016, ArXiv.

[73]  Walter Daelemans,et al.  Simple Queries as Distant Labels for Predicting Gender on Twitter , 2017, NUT@EMNLP.

[74]  Ming Zhou,et al.  BERT-based Lexical Substitution , 2019, ACL.

[75]  Ariel Stolerman,et al.  Breaking the Closed-World Assumption in Stylometric Authorship Attribution , 2014, IFIP Int. Conf. Digital Forensics.

[76]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[77]  Reihaneh Safavi-Naini,et al.  Secure Obfuscation of Authoring Style , 2015, WISTP.