Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification

In this paper, we ask the research question of whether all the datasets in the benchmark are necessary. We approach this by first characterizing the distinguishability of datasets when comparing different systems. Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power. We further, taking the text classification task as a case study, investigate the possibility of predicting dataset discrimination based on its properties (e.g., average sentence length). Our preliminary experiments promisingly show that given a sufficient number of training experimental records, a meaningful predictor can be learned to estimate dataset discrimination over unseen datasets. We released all datasets with features explored in this work on DataLab.

[1]  Pengfei Liu,et al.  DataLab: A Platform for Data Analysis and Intervention , 2022, ACL.

[2]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[3]  Jinlan Fu,et al.  XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation , 2021, EMNLP.

[4]  Samuel R. Bowman,et al.  What Will it Take to Fix Benchmarking in Natural Language Understanding? , 2021, NAACL.

[5]  Jinlan Fu,et al.  Towards More Fine-grained and Reliable NLP Performance Prediction , 2021, EACL.

[6]  Jungo Kasai,et al.  GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation , 2021, ArXiv.

[7]  Nikhil Ketkar,et al.  Convolutional Neural Networks , 2021, Deep Learning with Python.

[8]  Yiming Yang,et al.  Predicting Performance for Natural Language Processing Tasks , 2020, ACL.

[9]  Pengfei Liu,et al.  Extractive Summarization as Text Matching , 2020, ACL.

[10]  Orhan Firat,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[11]  Xuanjing Huang,et al.  Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study , 2020, AAAI.

[12]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[13]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[16]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[17]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[18]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[19]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[20]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[21]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[22]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[23]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[24]  Frank Hutter,et al.  Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves , 2015, IJCAI.

[25]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[28]  T. Chai,et al.  Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature , 2014 .

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[31]  Juliane Fluck,et al.  Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports , 2012, J. Biomed. Informatics.

[32]  Marc Dymetman,et al.  Prediction of Learning Curves in Machine Translation , 2012, ACL.

[33]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[34]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[35]  Leif E. Peterson K-nearest neighbor , 2009, Scholarpedia.

[36]  Philipp Koehn,et al.  Predicting Success in Machine Translation , 2008, EMNLP.

[37]  Nello Cristianini,et al.  Learning Performance of a Machine Translation System: a Statistical and Computational Analysis , 2008, WMT@ACL.

[38]  Xiao Chen,et al.  The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Named Entity Recognition and Chinese POS Tagging , 2008, IJCNLP.

[39]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[40]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[41]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[42]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[43]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[44]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[45]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[46]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[47]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[48]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[49]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[50]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[51]  Vladimir Vapnik,et al.  The Support Vector Method , 1997, ICANN.

[52]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[53]  J. R. Quinlan Probabilistic decision trees , 1990 .

[54]  B. Richards Type/Token Ratios: what do they really tell us? , 1987, Journal of Child Language.

[55]  D. Cox,et al.  Statistical significance tests. , 1982, British journal of clinical pharmacology.

[56]  J. H. Zar,et al.  Significance Testing of the Spearman Rank Correlation Coefficient , 1972 .