DQI: A Guide to Benchmark Evaluation

A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not? Recent research provides the answer: spurious bias. However, developing A to solve benchmarks B through E does not guarantee that it will solve future benchmarks. To progress towards a model that `truly learns' an underlying task, we need to quantify the differences between successive benchmarks, as opposed to existing binary and black-box approaches. We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.

[1]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[2]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[3]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[4]  G. Ozolins,et al.  WHO guidelines for drinking-water quality. , 1984, WHO chronicle.

[5]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[6]  Tania Martellini,et al.  Indoor Air Quality and Health , 2017, International journal of environmental research and public health.

[7]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[8]  Nathanael Chambers,et al.  A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.

[9]  Yejin Choi,et al.  Adversarial Filters of Dataset Biases , 2020, ICML.

[10]  Chitta Baral,et al.  DQI: Measuring Data Quality in NLP , 2020, ArXiv.

[11]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[12]  James Henderson,et al.  Simple but effective techniques to reduce biases , 2019, ArXiv.

[13]  Haohan Wang,et al.  Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual , 2019, EMNLP.

[14]  Yi Li,et al.  REPAIR: Removing Representation Bias by Dataset Resampling , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yi Li,et al.  RESOUND: Towards Action Recognition Without Representation Bias , 2018, ECCV.

[16]  Math Bollen,et al.  Understanding Power Quality Problems: Voltage Sags and Interruptions , 1999 .

[17]  James Y. Zou,et al.  Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[18]  Chitta Baral,et al.  Our Evaluation Metric Needs an Update to Encourage Generalization , 2020, ArXiv.

[19]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[20]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[21]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[22]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[23]  Yejin Choi,et al.  The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task , 2017, CoNLL.

[24]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[25]  K. Grunert Food quality and safety: consumer perception and demand , 2005 .

[26]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[27]  Math Bollen,et al.  Understanding Power Quality Problems , 1999 .

[28]  Roy Schwartz,et al.  Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets , 2019, NAACL.