A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension

The issue of shortcut learning is widely known in NLP and has been an important research focus in recent years. Unintended correlations in the data enable models to easily solve tasks that were meant to exhibit advanced language understanding and reasoning capabilities. In this survey paper, we focus on the field of machine reading comprehension (MRC), an important task for showcasing high-level language understanding that also suffers from a range of shortcuts. We summarize the available techniques for measuring and mitigating shortcuts and conclude with suggestions for further progress in shortcut research. Importantly, we highlight two concerns for shortcut mitigation in MRC: (1) the lack of public challenge sets, a necessary component for effective and reusable evaluation, and (2) the lack of certain mitigation techniques that are prominent in other areas.

[1]  Maxime Cordy,et al.  How do humans perceive adversarial text? A reality check on the validity and naturalness of word-based adversarial attacks , 2023, ACL.

[2]  Wayne Xin Zhao,et al.  A Survey of Large Language Models , 2023, ArXiv.

[3]  William Yang Wang,et al.  STREET: A Multi-Task Structured Reasoning and Explanation Benchmark , 2023, ICLR.

[4]  Akiko Aizawa,et al.  Analyzing the Effectiveness of the Underlying Reasoning Tasks in Multi-hop Question Answering , 2023, FINDINGS.

[5]  Phong Nguyen-Thuan Do,et al.  The Impacts of Unanswerable Questions on the Robustness of Machine Reading Comprehension Models , 2023, EACL.

[6]  Fei Huang,et al.  Reasoning with Language Model Prompting: A Survey , 2022, ACL.

[7]  Akiko Aizawa,et al.  How Well Do Multi-hop Reading Comprehension Models Understand Date Information? , 2022, AACL.

[8]  Mengnan Du,et al.  Shortcut Learning of Large Language Models in Natural Language Understanding , 2022, Commun. ACM.

[9]  Danqi Chen,et al.  Can Rationalization Improve Robustness? , 2022, NAACL.

[10]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[11]  Xuezhi Wang,et al.  Measure and Improve Robustness in NLP Models: A Survey , 2021, NAACL.

[12]  Yonghong Yan,et al.  Decomposing Complex Questions Makes Multi-Hop QA Easier and More Interpretable , 2021, EMNLP.

[13]  Xuezhi Wang,et al.  Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models , 2021, NAACL-HLT.

[14]  D. Wang,et al.  More Than Reading Comprehension: A Survey on Datasets and Metrics of Textual Question Answering , 2021, ArXiv.

[15]  Alessandra Russo,et al.  Numerical reasoning in machine reading comprehension tasks: are we there yet? , 2021, EMNLP.

[16]  Abbas Ghaddar,et al.  End-to-End Self-Debiasing Framework for Robust NLU Training , 2021, FINDINGS.

[17]  Ashish Sabharwal,et al.  ♫ MuSiQue: Multihop Questions via Single-hop Question Composition , 2021, TACL.

[18]  Jonathan Berant,et al.  Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition , 2021, TACL.

[19]  Matt Gardner,et al.  QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension , 2021, ACM Comput. Surv..

[20]  Chelsea Finn,et al.  Just Train Twice: Improving Group Robustness without Training Group Information , 2021, ICML.

[21]  Seung-won Hwang,et al.  Robustifying Multi-hop QA through Pseudo-Evidentiality Training , 2021, ACL.

[22]  Dongyan Zhao,et al.  Why Machine Reading Comprehension Models Learn Shortcuts? , 2021, FINDINGS.

[23]  Nai Ding,et al.  Using Adversarial Attacks to Reveal the Statistical Bias in Machine Reading Comprehension Models , 2021, ACL.

[24]  Eduard Hovy,et al.  A Survey of Data Augmentation Approaches for NLP , 2021, FINDINGS.

[25]  S. Riedel,et al.  Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation , 2021, EMNLP.

[26]  Oyvind Tafjord,et al.  Explaining Answers with Entailment Trees , 2021, EMNLP.

[27]  Zhiyi Ma,et al.  Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.

[28]  Jonathan Berant,et al.  Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , 2021, Transactions of the Association for Computational Linguistics.

[29]  Goran Nenadic,et al.  Semantics Altering Modifications for Evaluating Comprehension in Machine Reading , 2020, AAAI.

[30]  Yonatan Belinkov,et al.  Learning from others' mistakes: Avoiding dataset biases without modeling them , 2020, ICLR.

[31]  Akiko Aizawa,et al.  Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , 2020, COLING.

[32]  Iryna Gurevych,et al.  Improving QA Generalization by Concurrent Modeling of Multiple Biases , 2020, FINDINGS.

[33]  Yu Cheng,et al.  InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective , 2020, ICLR.

[34]  Iryna Gurevych,et al.  Towards Debiasing NLU Models from Unknown Biases , 2020, EMNLP.

[35]  Yejin Choi,et al.  Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics , 2020, EMNLP.

[36]  Sebastian Riedel,et al.  Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets , 2020, EACL.

[37]  Goran Nenadic,et al.  Beyond Leaderboards: A survey of methods for revealing weaknesses in Natural Language Inference data and models , 2020, ArXiv.

[38]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[39]  Jennifer Chu-Carroll,et al.  To Test Machine Comprehension, Start by Defining Comprehension , 2020, ACL.

[40]  Ashish Sabharwal,et al.  Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning , 2020, EMNLP.

[41]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[42]  Iryna Gurevych,et al.  Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance , 2020, ACL.

[43]  Jaewoo Kang,et al.  Look at the First Sentence: Position Bias in Question Answering , 2020, EMNLP.

[44]  Ting Liu,et al.  Benchmarking Robustness of Machine Reading Comprehension Models , 2020, FINDINGS.

[45]  Benjamin Recht,et al.  The Effect of Natural Distribution Shift on Question Answering Models , 2020, ICML.

[46]  Jianfeng Gao,et al.  Adversarial Training for Large Neural Language Models , 2020, ArXiv.

[47]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[48]  Daniel Khashabi,et al.  More Bang for Your Buck: Natural Perturbation for Robust Question Answering , 2020, EMNLP.

[49]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[50]  Amir Saffari,et al.  What Do Models Learn from Question Answering Datasets? , 2020, EMNLP.

[51]  John X. Morris,et al.  Reevaluating Adversarial Examples in Natural Language , 2020, FINDINGS.

[52]  Goran Nenadic,et al.  A Framework for Evaluation of Machine Reading Comprehension Gold Standards , 2020, LREC.

[53]  Hwee Tou Ng,et al.  Do Multi-Hop Question Answering Systems Know How to Answer the Single-Hop Sub-Questions? , 2020, EACL.

[54]  Kyunghyun Cho,et al.  Unsupervised Question Decomposition for Question Answering , 2020, EMNLP.

[55]  Ronan Le Bras,et al.  Adversarial Filters of Dataset Biases , 2020, ICML.

[56]  Sebastian Riedel,et al.  Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension , 2020, Transactions of the Association for Computational Linguistics.

[57]  Daniel Deutch,et al.  Break It Down: A Question Understanding Benchmark , 2020, TACL.

[58]  Hossein Amirkhani,et al.  A Survey on Machine Reading Comprehension Systems , 2020, Natural Language Engineering.

[59]  Kentaro Inui,et al.  Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets , 2019, AAAI.

[60]  Eric Nyberg,et al.  Bend but Don’t Break? Multi-Challenge Stress Test for QA Models , 2019, EMNLP.

[61]  Shuohang Wang,et al.  What does BERT Learn from Multiple-Choice Reading Comprehension Datasets? , 2019, ArXiv.

[62]  Kentaro Inui,et al.  R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason , 2019, ACL.

[63]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[64]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[65]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[66]  Regina Barzilay,et al.  Towards Debiasing Fact Verification Models , 2019, EMNLP.

[67]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[68]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[69]  Weiming Zhang,et al.  Neural Machine Reading Comprehension: Methods and Trends , 2019, Applied Sciences.

[70]  Hwee Tou Ng,et al.  Improving the Robustness of Question Answering Systems to Question Paraphrasing , 2019, ACL.

[71]  Sameer Singh,et al.  Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.

[72]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[73]  Hannaneh Hajishirzi,et al.  Multi-hop Reading Comprehension through Question Decomposition and Rescoring , 2019, ACL.

[74]  Sameer Singh,et al.  Compositional Questions Do Not Necessitate Multi-hop Reasoning , 2019, ACL.

[75]  Mohit Bansal,et al.  Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA , 2019, ACL.

[76]  Simon Ostermann,et al.  MCScript2.0: A Machine Comprehension Corpus Focused on Script Events and Participants , 2019, *SEMEVAL.

[77]  Roy Schwartz,et al.  Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets , 2019, NAACL.

[78]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[79]  Claire Cardie,et al.  DREAM: A Challenge Data Set and Models for Dialogue-Based Reading Comprehension , 2019, TACL.

[80]  Quan Z. Sheng,et al.  Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey , 2019 .

[81]  Przemyslaw Biecek,et al.  Are you tough enough? Framework for Robustness Validation of Machine Comprehension Systems , 2018, ArXiv.

[82]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[83]  Kentaro Inui,et al.  What Makes Reading Comprehension Questions Easier? , 2018, EMNLP.

[84]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[85]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[86]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[87]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[88]  Yoshihiko Hayashi,et al.  Answerable or Not: Devising a Dataset for Extending Machine Reading Comprehension , 2018, COLING.

[89]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[90]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[91]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[92]  Mohit Bansal,et al.  Robust Machine Comprehension Models via Adversarial Training , 2018, NAACL.

[93]  Simon Ostermann,et al.  MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge , 2018, LREC.

[94]  Sebastian Riedel,et al.  Constructing Datasets for Multi-hop Reading Comprehension Across Documents , 2017, TACL.

[95]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[96]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[97]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[98]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[99]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[100]  David A. McAllester,et al.  Who did What: A Large-Scale Person-Centered Cloze Dataset , 2016, EMNLP.

[101]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[102]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[103]  Marco Tulio Ribeiro,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, HLT-NAACL Demos.

[104]  Jason Weston,et al.  The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations , 2015, ICLR.

[105]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[106]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[107]  Matthew Richardson,et al.  MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[108]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[109]  J. Schilperoord,et al.  Linguistics , 1999 .

[110]  Emmanouel A. Varvarigos,et al.  Survey , 2016, ACM Comput. Surv..

[111]  G. Lapalme,et al.  Unsupervised multiple-choice question generation for out-of-domain Q&A fine-tuning , 2022, ACL.

[112]  Timothy J. Hazen,et al.  Increasing Robustness to Spurious Correlations using Forgettable Examples , 2021, EACL.

[113]  Akiko Aizawa,et al.  Benchmarking Machine Reading Comprehension: A Psychological Perspective , 2021, EACL.

[114]  Viktor Schlegel,et al.  Is the Understanding of Explicit Discourse Relations Required in Machine Reading Comprehension? , 2021, EACL.

[115]  Ana Marasovi'c,et al.  Teach Me to Explain: A Review of Datasets for Explainable NLP , 2021, ArXiv.

[116]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[117]  Jifan Chen,et al.  Understanding Dataset Design Choices for Multi-hop Reasoning , 2019, NAACL.

[118]  Danqi Chen Neural reading comprehension and beyond , 2018 .

[119]  Lucia Specia,et al.  Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , 2017, EMNLP 2017.

[120]  Marco Tulio Ribeiro,et al.  “ Why Should I Trust You ? ” Explaining the Predictions of Any Classifier , 2016 .