Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: Humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalization to data collected without a model. We find that training on adversarially collected samples leads to strong generalization to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F1 on questions that it cannot answer when trained on SQuAD—only marginally lower than when trained on data collected using RoBERTa itself (41.0F1).

[1]  Sebastian Riedel,et al.  Constructing Datasets for Multi-hop Reading Comprehension Across Documents , 2017, TACL.

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[4]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[5]  Eric Wallace,et al.  Trick Me If You Can: Human-in-the-Loop Generation of Adversarial Examples for Question Answering , 2018, TACL.

[6]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[7]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[8]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[9]  Yejin Choi,et al.  The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task , 2017, CoNLL.

[10]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[11]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[12]  Doug Downey,et al.  CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense , 2019, ArXiv.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Christos Christodoulopoulos,et al.  The FEVER2.0 Shared Task , 2019, EMNLP.

[15]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[16]  Matthew Richardson,et al.  MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[17]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[18]  Dirk Weissenborn,et al.  Making Neural QA as Simple as Possible but not Simpler , 2017, CoNLL.

[19]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[20]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[21]  Emily M. Bender,et al.  Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task , 2017, Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems.

[22]  Xiaodong Liu,et al.  ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.

[23]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[24]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[25]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[26]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[27]  Jason Weston,et al.  Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack , 2019, EMNLP.

[28]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[29]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[30]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[31]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[32]  Richard Socher,et al.  Efficient and Robust Question Answering from Minimal Context over Documents , 2018, ACL.

[33]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[34]  Guillaume Bouchard,et al.  Interpretation of Natural Language Rules in Conversational Machine Reading , 2018, EMNLP.

[35]  Kentaro Inui,et al.  What Makes Reading Comprehension Questions Easier? , 2018, EMNLP.

[36]  Pushmeet Kohli,et al.  Strength in Numbers: Trading-off Robustness and Computation via Adversarially-Trained Ensembles , 2018, ArXiv.

[37]  Danqi Chen,et al.  A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[38]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[39]  Roy Schwartz,et al.  Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets , 2019, NAACL.

[40]  Jason Weston,et al.  Mastering the Dungeon: Grounded Language Learning by Mechanical Turker Descent , 2017, ICLR.

[41]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[42]  Noah A. Smith,et al.  Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning , 2019, EMNLP.

[43]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[44]  Chris Dyer,et al.  The NarrativeQA Reading Comprehension Challenge , 2017, TACL.

[45]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[46]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[47]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[48]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[49]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[50]  Kentaro Inui,et al.  Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets , 2019, AAAI.