What are the best systems? New perspectives on NLP Benchmarking

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on stateof-the-art systems than the mean-aggregation procedure while being both more reliable and robust.

[1]  P. Piantanida,et al.  InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation , 2021, AAAI.

[2]  Sanket Vaibhav Mehta,et al.  ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning , 2021, ArXiv.

[3]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[4]  R. Salakhutdinov,et al.  FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding , 2021, ACL.

[5]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[6]  Gerard de Melo,et al.  NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation , 2021, Northern European Journal of Language Technology.

[7]  Wei Zhao,et al.  Better than Average: Paired Evaluation of NLP systems , 2021, ACL.

[8]  Giovanna Varni,et al.  Beam Search with Bidirectional Strategies for Neural Response Generation , 2021, ICNLSP.

[9]  Matthieu Labeau,et al.  Improving Multimodal fusion via Mutual Dependency Maximisation , 2021, EMNLP.

[10]  Pablo Piantanida,et al.  Automatic Text Evaluation through the Lens of Wasserstein Barycenters , 2021, EMNLP.

[11]  Sebastian Gehrmann,et al.  Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards , 2021, GEM.

[12]  Fernando Diaz,et al.  The Benchmark Lottery , 2021, ArXiv.

[13]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[14]  Constantin Orasan,et al.  An Exploratory Analysis of Multilingual Word-Level Quality Estimation with Cross-Lingual Transformers , 2021, ACL.

[15]  Dan Klein,et al.  Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level , 2021, FINDINGS.

[16]  C. Clavel,et al.  A Novel Estimator of Mutual Information for Learning to Disentangle Textual Representations , 2021, ACL.

[17]  Graham Neubig,et al.  ExplainaBoard: An Explainable Leaderboard for NLP , 2021, ACL.

[18]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[19]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[20]  Dragomir R. Radev,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.

[21]  Pierre Colombo Apprendre à représenter et à générer du texte en utilisant des mesures d'information. (Learning to represent and generate text using information measures) , 2021 .

[22]  R. Busa-Fekete,et al.  Private and Non-private Uniformity Testing for Ranking Data , 2021, NeurIPS.

[23]  Stéphan Clémençon,et al.  Depth-based pseudo-metrics between probability distributions , 2021, ArXiv.

[24]  Graham Neubig,et al.  Re-evaluating Evaluation in Text Summarization , 2020, EMNLP.

[25]  Matthieu Labeau,et al.  The Importance of Fillers for Text Representations of Speech Transcripts , 2020, EMNLP.

[26]  Matthieu Labeau,et al.  Hierarchical Pre-training for Sequence Labelling in Spoken Dialog , 2020, FINDINGS.

[27]  Christopher Joseph Pal,et al.  Would you Rather? A New Benchmark for Learning Machine Alignment with Cultural Values and Social Preferences , 2020, ACL.

[28]  Maxine Eskenazi,et al.  USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.

[29]  Orhan Firat,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[30]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[31]  Matteo Manica,et al.  Guiding attention in Sequence-to-sequence models for Dialogue Act prediction , 2020, AAAI.

[32]  Mikel Artetxe,et al.  On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[33]  Holger Schwenk,et al.  MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[34]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[35]  Chloé Clavel,et al.  From the Token to the Review: A Hierarchical Multimodal approach to Opinion Mining , 2019, EMNLP/IJCNLP.

[36]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[37]  Jason Baldridge,et al.  PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.

[38]  Judith Tonhauser,et al.  The CommitmentBank: Investigating projection in naturally occurring discourse , 2019 .

[39]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[40]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[41]  James Kennedy,et al.  Affect-Driven Dialog Generation , 2019, NAACL.

[42]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[43]  Trevor Cohn,et al.  Massively Multilingual Transfer for NER , 2019, ACL.

[44]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[45]  José Camacho-Collados,et al.  WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.

[46]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[47]  Pan He,et al.  Adversarial Examples: Attacks and Defenses for Deep Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[48]  Junfeng Hu,et al.  Meteor++ 2.0: Adopt Syntactic Level Paraphrase Knowledge into Machine Translation Evaluation , 2019, WMT.

[49]  Xiaodong Liu,et al.  ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.

[50]  Ashutosh Modi,et al.  Disney at IEST 2018: Predicting Emotions using an Ensemble , 2018, WASSA@EMNLP.

[51]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[52]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[53]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[54]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[55]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[56]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[57]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[58]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[59]  Iryna Gurevych,et al.  Learning to Score System Summaries for Better Content Selection Evaluation. , 2017, NFiS@EMNLP.

[60]  Shashi Narayan,et al.  Creating Training Corpora for NLG Micro-Planners , 2017, ACL.

[61]  Pierre Zweigenbaum,et al.  Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.

[62]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[63]  Stéphan Clémençon,et al.  A Learning Theory of Ranking Aggregation , 2017, AISTATS.

[64]  Maja Popovic,et al.  chrF++: words helping character n-grams , 2017, WMT.

[65]  Heng Ji,et al.  Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.

[66]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[67]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[68]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[69]  Jun-Ping Ng,et al.  Better Summarization Evaluation with Word Embeddings for ROUGE , 2015, EMNLP.

[70]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[71]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[72]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[73]  R. Myerson Fundamentals of Social Choice Theory , 2013 .

[74]  Ariel D. Procaccia,et al.  When do noisy votes reveal the truth? , 2013, EC '13.

[75]  Marina Meila,et al.  Experiments with Kemeny ranking: What works when? , 2012, Math. Soc. Sci..

[76]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[77]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[78]  Edith Hemaspaandra,et al.  Bypassing Combinatorial Protections: Polynomial-Time Algorithms for Single-Peaked Electorates , 2010, AAAI.

[79]  Rolf Niedermeier,et al.  How similarity helps to efficiently compute Kemeny rankings , 2009, AAMAS.

[80]  Nicolas de Condorcet Essai Sur L'Application de L'Analyse a la Probabilite Des Decisions Rendues a la Pluralite Des Voix , 2009 .

[81]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[82]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[83]  Hoa Trang Dang,et al.  Overview of the TAC 2008 Update Summarization Task , 2008, TAC.

[84]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[85]  Jianfeng Gao,et al.  An Information-Theoretic Approach to Automatic Evaluation of Summaries , 2006, NAACL.

[86]  Atri Rudra,et al.  Ordering by weighted number of wins gives a good ranking for weighted tournaments , 2006, SODA '06.

[87]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[88]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[89]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[90]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[91]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[92]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[93]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[94]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[95]  S. Shapiro,et al.  Mathematics without Numbers , 1993 .

[96]  M. Trick,et al.  The computational difficulty of manipulating an election , 1989 .

[97]  H. Young Condorcet's Theory of Voting , 1988, American Political Science Review.

[98]  M. Fligner,et al.  Multistage Ranking Models , 1988 .

[99]  M. Fligner,et al.  Distance Based Ranking Models , 1986 .

[100]  R. Duncan Luce,et al.  Individual Choice Behavior: A Theoretical Analysis , 1979 .

[101]  H. Young,et al.  A Consistent Extension of Condorcet’s Election Principle , 1978 .

[102]  R. Plackett The Analysis of Permutations , 1975 .

[103]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS , 1952 .

[104]  M. Kendall The treatment of ties in ranking problems. , 1945, Biometrika.

[105]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .