Linguistic Appropriateness and Pedagogic Usefulness of Reading Comprehension Questions

Automatic generation of reading comprehension questions is a topic receiving growing interest in the NLP community, but there is currently no consensus on evaluation metrics and many approaches focus on linguistic quality only while ignoring the pedagogic value and appropriateness of questions. This paper overcomes such weaknesses by a new evaluation scheme where questions from the questionnaire are structured in a hierarchical way to avoid confronting human annotators with evaluation measures that do not make sense for a certain question. We show through an annotation study that our scheme can be applied, but that expert annotators with some level of expertise are needed. We also created and evaluated two new evaluation data sets from the biology domain for Basque and German, composed of questions written by people with an educational background, which will be publicly released. Results show that manually generated questions are in general both of higher linguistic as well as pedagogic quality and that among the human generated questions, teacher-generated ones tend to be most useful.

[1]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[2]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[3]  Mitesh M. Khapra,et al.  Towards a Better Metric for Evaluating Question Generation Systems , 2018, EMNLP.

[4]  Rodney D. Nielsen,et al.  Linguistic Considerations in Automatic Question Generation , 2014, ACL.

[5]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[6]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[7]  Matthew Richardson,et al.  MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[8]  Tomoko Kojiri,et al.  Automatic Question Generation for Educational Applications - The State of Art , 2014, ICCSAMA.

[9]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[10]  Richard R. Day,et al.  Developing Reading Comprehension Questions. , 2005 .

[11]  Jie Yang,et al.  LearningQ: A Large-Scale Dataset for Educational Question Generation , 2018, ICWSM.

[12]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[13]  Danielle S McNamara,et al.  Comparing comprehension measured by multiple-choice and open-ended questions. , 2013, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[14]  Nelson F. Liu,et al.  Crowdsourcing Multiple Choice Science Questions , 2017, NUT@EMNLP.

[15]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[16]  Luke S. Zettlemoyer,et al.  Large-Scale QA-SRL Parsing , 2018, ACL.

[17]  Jonghyun Choi,et al.  Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[19]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[20]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[21]  Paul Piwek,et al.  Evaluation methodologies in Automatic Question Generation 2013-2018 , 2018, INLG.

[22]  Kyunghyun Cho,et al.  SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine , 2017, ArXiv.

[23]  Yi Yang,et al.  WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.

[24]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[25]  Xinya Du,et al.  Learning to Ask: Neural Question Generation for Reading Comprehension , 2017, ACL.

[26]  Chris Dyer,et al.  The NarrativeQA Reading Comprehension Challenge , 2017, TACL.

[27]  Pushpak Bhattacharyya,et al.  Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU , 2006 .

[28]  Noah A. Smith,et al.  Automatic factual question generation from text , 2011 .

[29]  R. Hwa,et al.  Question Generation as a Competitive Undergraduate Course Project , 2008 .