Adversarial Semantic Collisions

We study semantic collisions: texts that are semantically unrelated but judged as similar by NLP models. We develop gradient-based approaches for generating semantic collisions and demonstrate that state-of-the-art models for many tasks which rely on analyzing the meaning and similarity of texts-- including paraphrase identification, document retrieval, response suggestion, and extractive summarization-- are vulnerable to semantic collisions. For example, given a target query, inserting a crafted collision into an irrelevant document can shift its retrieval rank from 1000 to top 3. We show how to generate semantic collisions that evade perplexity-based filtering and discuss other potential mitigations. Our code is available at https://github.com/csong27/collision-bert.

[1]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[2]  Tianhao Zhang,et al.  Approximate Feature Collisions in Neural Nets , 2019, NeurIPS.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[5]  Sameer Singh,et al.  Generating Natural Adversarial Examples , 2017, ICLR.

[6]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[7]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[8]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[9]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[10]  Jason Weston,et al.  Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[11]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[12]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[15]  Xirong Li,et al.  Deep Text Classification Can be Fooled , 2017, IJCAI.

[16]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[17]  Radha Poovendran,et al.  Deceiving Google's Perspective API Built for Detecting Toxic Comments , 2017, ArXiv.

[18]  Shruti Tople,et al.  To Transfer or Not to Transfer: Misclassification Attacks Against Transfer Learned Text Classifiers , 2020, ArXiv.

[19]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[20]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[21]  Chris Callison-Burch,et al.  Human and Automatic Detection of Generated Text , 2019, ArXiv.

[22]  Jason Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2020, ICLR.

[23]  Dawn Song,et al.  Imitation Attacks and Defenses for Black-box Machine Translation Systems , 2020, EMNLP.

[24]  Peter Young,et al.  Smart Reply: Automated Response Suggestion for Email , 2016, KDD.

[25]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[26]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[27]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[28]  Jimmy J. Lin,et al.  Applying BERT to Document Retrieval with Birch , 2019, EMNLP.

[29]  Jörn-Henrik Jacobsen,et al.  Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness , 2019, ArXiv.

[30]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[31]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[32]  Micha Elsner,et al.  Breaking NLP: Using Morphosyntax, Semantics, Pragmatics and World Knowledge to Fool Sentiment Analysis Systems , 2017 .

[33]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[34]  Ankur P. Parikh,et al.  Thieves on Sesame Street! Model Extraction of BERT-based APIs , 2019, ICLR.

[35]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[36]  Aleksander Madry,et al.  On Evaluating Adversarial Robustness , 2019, ArXiv.

[37]  Graham Neubig,et al.  On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models , 2019, NAACL.

[38]  Matthias Bethge,et al.  Excessive Invariance Causes Adversarial Vulnerability , 2018, ICLR.

[39]  Jason Weston,et al.  Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , 2019 .

[40]  Jimmy J. Lin,et al.  Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval , 2019, EMNLP.