Adversarial Examples for Evaluating Reading Comprehension Systems

Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of 75% F1 score to 36%; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to 7%. We hope our insights will motivate the development of new models that understand language more precisely.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Pedro M. Domingos,et al.  Adversarial classification , 2004, KDD.

[3]  Christopher Meek,et al.  Adversarial learning , 2005, KDD '05.

[4]  Amir Globerson,et al.  Nightmare at test time: robust learning by feature deletion , 2006, ICML.

[5]  Mark Steedman,et al.  Unbounded Dependency Recovery for Parser Evaluation , 2009, EMNLP.

[6]  Nitin Madnani,et al.  Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods , 2010, CL.

[7]  Noah A. Smith Adversarial Evaluation for Models of Natural Language , 2012, ArXiv.

[8]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[9]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[10]  Hector J. Levesque,et al.  On our best behaviour , 2014, Artif. Intell..

[11]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[12]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[13]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[14]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[15]  Nina Narodytska,et al.  Simple Black-Box Adversarial Perturbations for Deep Networks , 2016, ArXiv.

[16]  Sandro Pezzelle,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[17]  Ananthram Swami,et al.  Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples , 2016, ArXiv.

[18]  Kenton Lee,et al.  Learning Recurrent Span Representations for Extractive Question Answering , 2016, ArXiv.

[19]  Percy Liang,et al.  Data Recombination for Neural Semantic Parsing , 2016, ACL.

[20]  Yelong Shen,et al.  ReasoNet: Learning to Stop Reading in Machine Comprehension , 2016, CoCo@NIPS.

[21]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[22]  Zhiguo Wang,et al.  Multi-Perspective Context Matching for Machine Comprehension , 2016, ArXiv.

[23]  Bowen Zhou,et al.  End-to-End Answer Chunk Extraction and Ranking for Reading Comprehension , 2016, 1610.09996.

[24]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[26]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[27]  Dirk Weissenborn,et al.  Making Neural QA as Simple as Possible but not Simpler , 2017, CoNLL.

[28]  Yuxing Peng,et al.  Mnemonic Reader for Machine Comprehension , 2017, ArXiv.

[29]  Li-Rong Dai,et al.  Exploring Question Understanding and Adaptation in Neural-Network-Based Question Answering , 2017, ArXiv.

[30]  Rui Liu,et al.  Structural Embedding of Syntactic Trees for Machine Comprehension , 2017, EMNLP.

[31]  Shuohang Wang,et al.  Machine Comprehension Using Match-LSTM and Answer Pointer , 2016, ICLR.

[32]  Samuel R. Bowman,et al.  Ruminating Reader: Reasoning with Gated Multi-hop Attention , 2017, QA@ACL.