Posterior calibration and exploratory analysis for natural language processing models

Many models in natural language processing define probabilistic distributions over linguistic structures. We argue that (1) the quality of a model' s posterior distribution can and should be directly evaluated, as to whether probabilities correspond to empirical frequencies, and (2) NLP uncertainty can be projected not only to pipeline components, but also to exploratory data analysis, telling a user when to trust and not trust the NLP analysis. We present a method to analyze calibration, and apply it to compare the miscalibration of several commonly used models. We also contribute a coreference sampling algorithm that can create confidence intervals for a political event extraction task.

[1]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[2]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[3]  Brendan T. O'Connor,et al.  Learning Latent Personas of Film Characters , 2013, ACL.

[4]  Philip A. Schrodt,et al.  Political Science: KEDS—A Program for the Machine Coding of Event Data , 1994 .

[5]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[6]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[7]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[8]  Noah A. Smith,et al.  Wider Pipelines: N-Best Alignments and Parses in MT Training , 2008, AMTA.

[9]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[10]  Jonathon Read,et al.  Using Emoticons to Reduce Dependency in Machine Learning Techniques for Sentiment Classification , 2005, ACL.

[11]  Noah A. Smith,et al.  Rich Source-Side Context for Statistical Machine Translation , 2008, WMT@ACL.

[12]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[13]  Christopher D. Manning,et al.  A Global Joint Model for Semantic Role Labeling , 2008, CL.

[14]  J. Tukey Curves As Parameters, and Touch Estimation , 1961 .

[15]  Dan Klein,et al.  A Joint Model for Entity Analysis: Coreference, Typing, and Linking , 2014, TACL.

[16]  Ralph Weischedel,et al.  Automatic Extraction of Events from Open Source Text for Predictive Forecasting , 2013 .

[17]  Dan Klein,et al.  Unsupervised Coreference Resolution in a Nonparametric Bayesian Model , 2007, ACL.

[18]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[19]  Dan Klein,et al.  Easy Victories and Uphill Battles in Coreference Resolution , 2013, EMNLP.

[20]  Veselin Stoyanov,et al.  Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure , 2011, AISTATS.

[21]  Nianwen Xue,et al.  CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes , 2011, CoNLL Shared Task.

[22]  Sampo Pyysalo,et al.  Evaluating Dependency Representations for Event Extraction , 2010, COLING.

[23]  Andrew Y. Ng,et al.  Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines , 2006, EMNLP.

[24]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[25]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[26]  Bradley P. Carlin,et al.  Markov Chain Monte Carlo in Practice: A Roundtable Discussion , 1998 .

[27]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[28]  Michael C. McCord,et al.  Deep parsing in Watson , 2012, IBM J. Res. Dev..

[29]  Joakim Nivre,et al.  Universal Stanford dependencies: A cross-linguistic typology , 2014, LREC.

[30]  David M. Blei,et al.  Bayesian Checking for Topic Models , 2011, EMNLP.

[31]  Gregory Shakhnarovich,et al.  A Systematic Exploration of Diversity in Machine Translation , 2013, EMNLP.

[32]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[33]  Joshua Goodman,et al.  Parsing Algorithms and Metrics , 1996, ACL.

[34]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[35]  Paul N. Bennett Assessing the Calibration of Naive Bayes Posterior Estimates , 2000 .

[36]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[37]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[38]  J. Brocker Reliability, Sufficiency, and the Decomposition of Proper Scores , 2008, 0806.0813.

[39]  A. H. Murphy,et al.  A General Framework for Forecast Verification , 1987 .

[40]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[41]  Philip A. Schrodt Precedents, Progress, and Prospects in Political Event Data , 2012 .

[42]  Niko Brümmer,et al.  Likelihood-ratio calibration using prior-weighted proper scoring rules , 2013, INTERSPEECH.

[43]  Noah A. Smith,et al.  Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions , 2010, NAACL.

[44]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[45]  Andrew McCallum,et al.  Joint inference of entities, relations, and coreference , 2013, AKBC '13.

[46]  Brendan T. O'Connor,et al.  Learning to Extract International Relations from Political Context , 2013, ACL.