Practical Annotation Strategies for Question Answering Datasets

Annotating datasets for question answering (QA) tasks is very costly, as it requires intensive manual labor and often domain-specific knowledge. Yet strategies for annotating QA datasets in a cost-effective manner are scarce. To provide a remedy for practitioners, our objective is to develop heuristic rules for annotating a subset of questions, so that the annotation cost is reduced while maintaining both in- and out-of-domain performance. For this, we conduct a large-scale analysis in order to derive practical recommendations. First, we demonstrate experimentally that more training samples contribute often only to a higher in-domain test-set performance, but do not help the model in generalizing to unseen datasets. Second, we develop a model-guided annotation strategy: it makes a recommendation with regard to which subset of samples should be annotated. Its effectiveness is demonstrated in a case study based on domain customization of QA to a clinical setting. Here, remarkably, annotating a stratified subset with only 1.2% of the original training set achieves 97.7% of the performance as if the complete dataset was annotated. Hence, the labeling effort can be reduced immensely. Altogether, our work fulfills a demand in practice when labeling budgets are limited and where thus recommendations are needed for annotating QA datasets more cost-effectively.

[1]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[2]  Ankur Taly,et al.  Did the Model Understand the Question? , 2018, ACL.

[3]  Kentaro Inui,et al.  What Makes Reading Comprehension Questions Easier? , 2018, EMNLP.

[4]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[5]  Jian Peng,et al.  emrQA: A Large Corpus for Question Answering on Electronic Medical Records , 2018, EMNLP.

[6]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[7]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[8]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[9]  Eunsol Choi,et al.  MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension , 2019, MRQA@EMNLP.

[10]  Stefan Feuerriegel,et al.  Learning from On-Line User Feedback in Neural Question Answering on the Web , 2019, WWW.

[11]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[14]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[15]  Diego Molla Aliod,et al.  Question Answering in Restricted Domains: An Overview , 2007, CL.

[16]  Jonathan Berant,et al.  MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension , 2019, ACL.

[17]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[18]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.