Calibrated Interpretation: Confidence Estimation in Semantic Parsing

Abstract Sequence generation models are increasingly being used to translate natural language into programs, i.e., to perform executable semantic parsing. The fact that semantic parsing aims to predict programs that can lead to executed actions in the real world motivates developing safe systems. This in turn makes measuring calibration—a central component to safety—particularly important. We investigate the calibration of popular generation models across four popular semantic parsing datasets, finding that it varies across models and datasets. We then analyze factors associated with calibration error and release new confidence-based challenge splits of two parsing datasets. To facilitate the inclusion of calibration in semantic parsing evaluations, we release a library for computing calibration metrics.1

[1]  Yann LeCun,et al.  Augmented Language Models: a Survey , 2023, Trans. Mach. Learn. Res..

[2]  Luke Zettlemoyer,et al.  Toolformer: Language Models Can Teach Themselves to Use Tools , 2023, NeurIPS.

[3]  Quoc V. Le,et al.  Inverse scaling can become U-shaped , 2022, EMNLP.

[4]  Jordan L. Boyd-Graber,et al.  Prompting GPT-3 To Be Reliable , 2022, ICLR.

[5]  Peter R. Florence,et al.  Interactive Language: Talking to Robots in Real Time , 2022, IEEE Robotics and Automation Letters.

[6]  James Lucas,et al.  The Calibration Generalization Gap , 2022, ArXiv.

[7]  D. Fox,et al.  ProgPrompt: Generating Situated Robot Task Plans using Large Language Models , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[8]  D. Klein,et al.  The Whole Truth and Nothing But the Truth: Faithful and Controllable Dialogue Response Generation with Dataflow Transduction and Constrained Decoding , 2022, ACL.

[9]  Tom B. Brown,et al.  Language Models (Mostly) Know What They Know , 2022, ArXiv.

[10]  Owain Evans,et al.  Teaching Models to Express Their Uncertainty in Words , 2022, Trans. Mach. Learn. Res..

[11]  Jordan L. Boyd-Graber,et al.  Re-Examining Calibration: The Case of Question Answering , 2022, EMNLP.

[12]  Benjamin Van Durme,et al.  When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems , 2022, EMNLP.

[13]  Trevor Darrell,et al.  Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly , 2022, ECCV.

[14]  Jivat Neet Kaur,et al.  Modern Baselines for SPARQL Semantic Parsing , 2022, SIGIR.

[15]  S. Savarese,et al.  CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , 2022, ICLR.

[16]  Swaroop Mishra,et al.  Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings , 2022, FINDINGS.

[17]  Benjamin Van Durme,et al.  Few-Shot Semantic Parsing with Language Models Trained on Code , 2021, NAACL.

[18]  Benjamin Van Durme,et al.  Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Dzmitry Bahdanau,et al.  PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models , 2021, EMNLP.

[20]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[21]  S. Savarese,et al.  Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation , 2021, CoRL.

[22]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[23]  Xiaohua Zhai,et al.  Revisiting the Calibration of Modern Neural Networks , 2021, NeurIPS.

[24]  Shrey Desai,et al.  Diagnosing Transformers in Task-Oriented Semantic Parsing , 2021, FINDINGS.

[25]  Dan Klein,et al.  Constrained Language Models Yield Few-Shot Semantic Parsers , 2021, EMNLP.

[26]  Benjamin Van Durme,et al.  Joint Universal Syntactic and Semantic Parsing , 2021, Transactions of the Association for Computational Linguistics.

[27]  Sabrina J. Mielke,et al.  Reducing Conversational Agents’ Overconfidence Through Linguistic Calibration , 2020, Transactions of the Association for Computational Linguistics.

[28]  Ming-Wei Chang,et al.  Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both? , 2020, ACL.

[29]  Diarmuid Ó Séaghdha,et al.  Conversational Semantic Parsing for Dialog State Tracking , 2020, EMNLP.

[30]  Tao Yu,et al.  Semantic Evaluation for Text-to-SQL with Distilled Test Suite , 2020, EMNLP.

[31]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[32]  Jacob Andreas,et al.  Task-Oriented Dialogue as Dataflow Synthesis , 2020, Transactions of the Association for Computational Linguistics.

[33]  D. Battaglia Beyond , 2020, Voluminous States.

[34]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[35]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[36]  Corey Lynch,et al.  Language Conditioned Imitation Learning Over Unstructured Data , 2020, Robotics: Science and Systems.

[37]  Shuming Shi,et al.  On the Inference Calibration of Neural Machine Translation , 2020, ACL.

[38]  Hadas Kress-Gazit,et al.  Robots That Use Language , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[39]  Hong Yu,et al.  Calibrating Structured Output Predictors for Natural Language Processing , 2020, ACL.

[40]  Shrey Desai,et al.  Calibration of Pre-trained Transformers , 2020, EMNLP.

[41]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[42]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[43]  Benjamin Van Durme,et al.  Universal Decompositional Semantic Parsing , 2019, ACL.

[44]  Teven Le Scao,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[45]  Wen-tau Yih,et al.  Model-based Interactive Semantic Parsing: A Unified Framework and A Text-to-SQL Case Study , 2019, EMNLP.

[46]  Luyao Chen,et al.  CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases , 2019, EMNLP.

[47]  Benjamin Van Durme,et al.  Broad-Coverage Semantic Parsing as Transduction , 2019, EMNLP.

[48]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[49]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[50]  Kevin Duh,et al.  AMR Parsing as Sequence-to-Graph Transduction , 2019, ACL.

[51]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[52]  Jinjun Xiong,et al.  Revisiting the Evaluation of Uncertainty Estimation and Its Application to Explore Model Complexity-Uncertainty Trade-Off , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53]  Sunita Sarawagi,et al.  Calibration of Encoder Decoder Models for Neural Machine Translation , 2019, ArXiv.

[54]  Sonal Gupta,et al.  Semantic Parsing for Task Oriented Dialog using Hierarchical Representations , 2018, EMNLP.

[55]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[56]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[57]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[58]  Ryen W. White,et al.  Natural Language Interfaces with Fine-Grained User Interaction: A Case Study on Web APIs , 2018, SIGIR.

[59]  Mirella Lapata,et al.  Confidence Modeling for Neural Semantic Parsing , 2018, ACL.

[60]  Raymond J. Mooney,et al.  Dialog for Language to Code , 2017, IJCNLP.

[61]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[62]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[63]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[64]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[65]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[66]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[67]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[68]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[69]  H. V. Jagadish,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[70]  Ran El-Yaniv,et al.  On the Foundations of Noise-free Selective Classification , 2010, J. Mach. Learn. Res..

[71]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[72]  Berthold Crysmann,et al.  Some Fine Points of Hybrid Natural Language Parsing , 2008, LREC.

[73]  Chris Callison-Burch,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[74]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[75]  Mary L. Cummings,et al.  Automation Bias in Intelligent Time Critical Decision Support Systems , 2004 .

[76]  C. Elkan,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[77]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[78]  David G. Novick,et al.  Natural-language interfaces , 2000, CHI Extended Abstracts.

[79]  Eric Horvitz,et al.  Principles of mixed-initiative user interfaces , 1999, CHI '99.

[80]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[81]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[82]  C. K. Chow,et al.  An optimum character recognition system using decision functions , 1957, IRE Trans. Electron. Comput..

[83]  Dan Jurafsky,et al.  Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models , 2023, ArXiv.

[84]  Anqi Liu,et al.  Density-Softmax: Scalable and Distance-Aware Uncertainty Estimation under Distribution Shifts , 2023, ArXiv.

[85]  Staffan Larsson,et al.  Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems , 2022, SIGDIAL.

[86]  Benjamin Van Durme,et al.  BenchCLAMP: A Benchmark for Evaluating Language Models on Semantic Parsing , 2022, ArXiv.

[87]  Shijie Chen Error Detection for Interactive Text-to-SQL Semantic Parsing , 2022 .

[88]  Jingbo Shang,et al.  Towards Collaborative Neural-Symbolic Graph Semantic Parsing via Uncertainty , 2022, FINDINGS.

[89]  Dan Klein,et al.  Value-Agnostic Conversational Semantic Parsing , 2021, ACL.

[90]  Jimmy J. Lin,et al.  The Art of Abstention: Selective Prediction and Error Regularization for Natural Language Processing , 2021, ACL.

[91]  Gregory D. Hager,et al.  Guiding Multi-Step Rearrangement Tasks with Natural Language Instructions , 2021, CoRL.

[92]  Descriptors Assertiveness,et al.  Conference of the , 1991 .

[93]  T. Winograd Understanding natural language , 1972 .

[94]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[95]  Dragomir R. Radev,et al.  of the Association for Computational Linguistics , 2022 .