Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference

While large pre-trained language models are powerful, their predictions often lack logical consistency across test inputs. For example, a state-of-the-art Macaw question-answering (QA) model answers Yes to Is a sparrow a bird? and Does a bird have feet? but answers No to Does a sparrow have feet?. To address this failure mode, we propose a framework, Consistency Correction through Relation Detection, or ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models using pre-trained natural language inference (NLI) models without fine-tuning or re-training. Given a batch of test inputs, ConCoRD samples several candidate outputs for each input and instantiates a factor graph that accounts for both the model’s belief about the likelihood of each answer choice in isolation and the NLI model’s beliefs about pair-wise answer choice compatibility. We show that a weighted MaxSAT solver can efficiently compute high-quality answer choices under this factor graph, improving over the raw model’s predictions. Our experiments demonstrate that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models using off-the-shelf NLI models, notably increasing accuracy of LXMERT on ConVQA by 5% absolute. See the project website (https://ericmitchell.ai/emnlp-2022-concord/) for code and data.

[1]  Ronan Le Bras,et al.  Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations , 2022, EMNLP.

[2]  Y. Matias,et al.  TRUE: Re-evaluating Factual Consistency Evaluation , 2022, NAACL.

[3]  Paul N. Bennett,et al.  SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization , 2021, TACL.

[4]  Christopher D. Manning,et al.  Fast Model Editing at Scale , 2021, ICLR.

[5]  Peter Clark,et al.  BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief , 2021, EMNLP.

[6]  Oyvind Tafjord,et al.  General-Purpose Question-Answering with Macaw , 2021, ArXiv.

[7]  Eunsol Choi,et al.  Can NLI Models Verify QA Systems' Predictions? , 2021, EMNLP.

[8]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[9]  E. Hovy,et al.  Measuring and Improving Consistency in Pretrained Language Models , 2021, Transactions of the Association for Computational Linguistics.

[10]  Yuhao Zhang,et al.  Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation , 2020, NAACL.

[11]  Ronan Le Bras,et al.  Unsupervised Commonsense Question Answering with Self-Talk , 2020, EMNLP.

[12]  Artem Babenko,et al.  Editable Neural Networks , 2020, ICLR.

[13]  Hannaneh Hajishirzi,et al.  Logic-Guided Data Augmentation and Regularization for Consistent Question Answering , 2020, ACL.

[14]  G. Martius,et al.  Differentiation of Blackbox Combinatorial Solvers , 2019, ICLR.

[15]  Haoyu Song,et al.  Generating Persona Consistent Dialogues by Exploiting Natural Language Inference , 2019, AAAI.

[16]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[17]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[18]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[19]  Ajay Divakaran,et al.  Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation , 2019, EMNLP.

[20]  Alexey Ignatiev,et al.  RC2: an Efficient MaxSAT Solver , 2019, J. Satisf. Boolean Model. Comput..

[21]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[22]  Vivek Srikumar,et al.  A Logic-Driven Framework for Consistency of Neural Models , 2019, EMNLP.

[23]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[24]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[25]  Michael Collins,et al.  Synthetic QA Corpora Generation with Roundtrip Consistency , 2019, ACL.

[26]  Osmar R. Zaïane,et al.  Evaluating Coherence in Dialogue Systems using Entailment , 2019, NAACL.

[27]  Jason Weston,et al.  Dialogue Natural Language Inference , 2018, ACL.

[28]  Percy Liang,et al.  Transforming Question Answering Datasets Into Natural Language Inference Datasets , 2018, ArXiv.

[29]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[30]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search for Improved Description of Complex Scenes , 2018, AAAI.

[31]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[32]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[33]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[34]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[35]  Christopher D. Manning,et al.  Modeling Semantic Containment and Exclusion in Natural Language Inference , 2008, COLING.

[36]  H.-A. Loeliger,et al.  An introduction to factor graphs , 2004, IEEE Signal Process. Mag..

[37]  David D. Cox,et al.  Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms , 2013, SciPy.