How Robust are Model Rankings : A Leaderboard Customization Approach for Equitable Evaluation

Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their ‘difficulty’ level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance– thus rectifying the overestimation of AI systems’ capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user’s focus area. This helps users analyze models’ strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41 % on average.

[1]  Carlo Lavalle,et al.  Assessing the influence of climate model uncertainty on EU-wide climate change impact indicators , 2013, Climatic Change.

[2]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[3]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[4]  Margaret Mitchell,et al.  Perturbation Sensitivity Analysis to Detect Unintended Model Biases , 2019, EMNLP.

[5]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[6]  Sebastian Riedel,et al.  Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets , 2020, EACL.

[7]  Ronan Le Bras,et al.  Adversarial Filters of Dataset Biases , 2020, ICML.

[8]  I. ClintHeyer Human-Robot Interaction and Future Industrial Robotics Applications , 2010 .

[9]  Aditi Raghunathan,et al.  Certified Robustness to Adversarial Word Substitutions , 2019, EMNLP.

[10]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[11]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[12]  Oren Etzioni,et al.  Green AI , 2019, Commun. ACM.

[13]  Boris Beizer,et al.  Black Box Testing: Techniques for Functional Testing of Software and Systems , 1996, IEEE Software.

[14]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[15]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[16]  Peter Henderson,et al.  Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning , 2020, ArXiv.

[17]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Jianqi Sun,et al.  Projection and uncertainty analysis of global precipitation‐related extremes using CMIP5 models , 2014 .

[21]  Swaroop Mishra,et al.  Do We Need to Create Big Datasets to Learn a Task? , 2020, SUSTAINLP.

[22]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[23]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[24]  Peter Szolovits,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2020, AAAI.

[25]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[26]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[27]  Yongdong Zhang,et al.  Curriculum Learning for Natural Language Understanding , 2020, ACL.

[28]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[29]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[30]  Chitta Baral,et al.  DQI: Measuring Data Quality in NLP , 2020, ArXiv.

[31]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[32]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33]  Roy Schwartz,et al.  Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.

[34]  Kyle Gorman,et al.  We Need to Talk about Standard Splits , 2019, ACL.

[35]  Maxime Peyrard,et al.  Studying Summarization Evaluation Metrics in the Appropriate Scoring Range , 2019, ACL.

[36]  Avrim Blum,et al.  The Ladder: A Reliable Leaderboard for Machine Learning Competitions , 2015, ICML.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[39]  Chitta Baral,et al.  Our Evaluation Metric Needs an Update to Encourage Generalization , 2020, ArXiv.

[40]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[41]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[42]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[43]  Wai Lam,et al.  Evaluation Challenges in Large-Scale Document Summarization , 2003, ACL.

[44]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[45]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[46]  Percy Liang,et al.  Selective Question Answering under Domain Shift , 2020, ACL.