论文信息 - DQI: A Guide to Benchmark Evaluation

DQI: A Guide to Benchmark Evaluation

A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not? Recent research provides the answer: spurious bias. However, developing A to solve benchmarks B through E does not guarantee that it will solve future benchmarks. To progress towards a model that `truly learns' an underlying task, we need to quantify the differences between successive benchmarks, as opposed to existing binary and black-box approaches. We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.

[1] Rachel Rudinger,et al. Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[2] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[3] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[4] G. Ozolins,et al. WHO guidelines for drinking-water quality. , 1984, WHO chronicle.

[5] Mohit Bansal,et al. Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[6] Tania Martellini,et al. Indoor Air Quality and Health , 2017, International journal of environmental research and public health.

[7] Yejin Choi,et al. WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[8] Nathanael Chambers,et al. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.

[9] Yejin Choi,et al. Adversarial Filters of Dataset Biases , 2020, ICML.

[10] Chitta Baral,et al. DQI: Measuring Data Quality in NLP , 2020, ArXiv.

[11] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[12] James Henderson,et al. Simple but effective techniques to reduce biases , 2019, ArXiv.

[13] Haohan Wang,et al. Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual , 2019, EMNLP.

[14] Yi Li,et al. REPAIR: Removing Representation Bias by Dataset Resampling , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Yi Li,et al. RESOUND: Towards Action Recognition Without Representation Bias , 2018, ECCV.

[16] Math Bollen,et al. Understanding Power Quality Problems: Voltage Sags and Interruptions , 1999 .

[17] James Y. Zou,et al. Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[18] Chitta Baral,et al. Our Evaluation Metric Needs an Update to Encourage Generalization , 2020, ArXiv.

[19] Eduard Hovy,et al. Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[20] Alexei A. Efros,et al. Unbiased look at dataset bias , 2011, CVPR 2011.

[21] Yejin Choi,et al. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[22] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[23] Yejin Choi,et al. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task , 2017, CoNLL.

[24] Luke Zettlemoyer,et al. Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[25] K. Grunert. Food quality and safety: consumer perception and demand , 2005 .

[26] Zachary C. Lipton,et al. How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[27] Math Bollen,et al. Understanding Power Quality Problems , 1999 .

[28] Roy Schwartz,et al. Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets , 2019, NAACL.