What are the best systems? New perspectives on NLP Benchmarking
[1] P. Piantanida,et al. InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation , 2021, AAAI.
[2] Sanket Vaibhav Mehta,et al. ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning , 2021, ArXiv.
[3] Alexander M. Rush,et al. Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.
[4] R. Salakhutdinov,et al. FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding , 2021, ACL.
[5] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[6] Gerard de Melo,et al. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation , 2021, Northern European Journal of Language Technology.
[7] Wei Zhao,et al. Better than Average: Paired Evaluation of NLP systems , 2021, ACL.
[8] Giovanna Varni,et al. Beam Search with Bidirectional Strategies for Neural Response Generation , 2021, ICNLSP.
[9] Matthieu Labeau,et al. Improving Multimodal fusion via Mutual Dependency Maximisation , 2021, EMNLP.
[10] Pablo Piantanida,et al. Automatic Text Evaluation through the Lens of Wasserstein Barycenters , 2021, EMNLP.
[11] Sebastian Gehrmann,et al. Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards , 2021, GEM.
[12] Fernando Diaz,et al. The Benchmark Lottery , 2021, ArXiv.
[13] Weizhe Yuan,et al. BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.
[14] Constantin Orasan,et al. An Exploratory Analysis of Multilingual Word-Level Quality Estimation with Cross-Lingual Transformers , 2021, ACL.
[15] Dan Klein,et al. Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level , 2021, FINDINGS.
[16] C. Clavel,et al. A Novel Estimator of Mutual Information for Learning to Disentangle Textual Representations , 2021, ACL.
[17] Graham Neubig,et al. ExplainaBoard: An Explainable Leaderboard for NLP , 2021, ACL.
[18] Diyi Yang,et al. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.
[19] Liu Yang,et al. Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.
[20] Dragomir R. Radev,et al. SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.
[21] Pierre Colombo. Apprendre à représenter et à générer du texte en utilisant des mesures d'information. (Learning to represent and generate text using information measures) , 2021 .
[22] R. Busa-Fekete,et al. Private and Non-private Uniformity Testing for Ranking Data , 2021, NeurIPS.
[23] Stéphan Clémençon,et al. Depth-based pseudo-metrics between probability distributions , 2021, ArXiv.
[24] Graham Neubig,et al. Re-evaluating Evaluation in Text Summarization , 2020, EMNLP.
[25] Matthieu Labeau,et al. The Importance of Fillers for Text Representations of Speech Transcripts , 2020, EMNLP.
[26] Matthieu Labeau,et al. Hierarchical Pre-training for Sequence Labelling in Spoken Dialog , 2020, FINDINGS.
[27] Christopher Joseph Pal,et al. Would you Rather? A New Benchmark for Learning Machine Alignment with Cultural Values and Social Preferences , 2020, ACL.
[28] Maxine Eskenazi,et al. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.
[29] Orhan Firat,et al. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.
[30] Eunsol Choi,et al. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.
[31] Matteo Manica,et al. Guiding attention in Sequence-to-sequence models for Dialogue Act prediction , 2020, AAAI.
[32] Mikel Artetxe,et al. On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.
[33] Holger Schwenk,et al. MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.
[34] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[35] Chloé Clavel,et al. From the Token to the Review: A Hierarchical Multimodal approach to Opinion Mining , 2019, EMNLP/IJCNLP.
[36] Fei Liu,et al. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.
[37] Jason Baldridge,et al. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.
[38] Judith Tonhauser,et al. The CommitmentBank: Investigating projection in naturally occurring discourse , 2019 .
[39] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[40] Ming-Wei Chang,et al. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.
[41] James Kennedy,et al. Affect-Driven Dialog Generation , 2019, NAACL.
[42] Jason Baldridge,et al. PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.
[43] Trevor Cohn,et al. Massively Multilingual Transfer for NER , 2019, ACL.
[44] Holger Schwenk,et al. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.
[45] José Camacho-Collados,et al. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.
[46] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.
[47] Pan He,et al. Adversarial Examples: Attacks and Defenses for Deep Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.
[48] Junfeng Hu,et al. Meteor++ 2.0: Adopt Syntactic Level Paraphrase Knowledge into Machine Translation Evaluation , 2019, WMT.
[49] Xiaodong Liu,et al. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.
[50] Ashutosh Modi,et al. Disney at IEST 2018: Predicting Emotions using an Ensemble , 2018, WASSA@EMNLP.
[51] Guillaume Lample,et al. XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.
[52] Richard Socher,et al. The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.
[53] Dan Roth,et al. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.
[54] Guillaume Lample,et al. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.
[55] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.
[56] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[57] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[58] Leslie Pack Kaelbling,et al. Generalization in Deep Learning , 2017, ArXiv.
[59] Iryna Gurevych,et al. Learning to Score System Summaries for Better Content Selection Evaluation. , 2017, NFiS@EMNLP.
[60] Shashi Narayan,et al. Creating Training Corpora for NLG Micro-Planners , 2017, ACL.
[61] Pierre Zweigenbaum,et al. Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.
[62] Eneko Agirre,et al. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.
[63] Stéphan Clémençon,et al. A Learning Theory of Ranking Aggregation , 2017, AISTATS.
[64] Maja Popovic,et al. chrF++: words helping character n-grams , 2017, WMT.
[65] Heng Ji,et al. Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.
[66] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[67] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.
[68] Maja Popovic,et al. chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.
[69] Jun-Ping Ng,et al. Better Summarization Evaluation with Word Embeddings for ROUGE , 2015, EMNLP.
[70] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.
[71] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[72] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.
[73] R. Myerson. Fundamentals of Social Choice Theory , 2013 .
[74] Ariel D. Procaccia,et al. When do noisy votes reveal the truth? , 2013, EC '13.
[75] Marina Meila,et al. Experiments with Kemeny ranking: What works when? , 2012, Math. Soc. Sci..
[76] Zornitsa Kozareva,et al. SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.
[77] Hector J. Levesque,et al. The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
[78] Edith Hemaspaandra,et al. Bypassing Combinatorial Protections: Polynomial-Time Algorithms for Single-Peaked Electorates , 2010, AAAI.
[79] Rolf Niedermeier,et al. How similarity helps to efficiently compute Kemeny rankings , 2009, AAMAS.
[80] Nicolas de Condorcet. Essai Sur L'Application de L'Analyse a la Probabilite Des Decisions Rendues a la Pluralite Des Voix , 2009 .
[81] Ido Dagan,et al. The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.
[82] Csaba Szepesvári,et al. Empirical Bernstein stopping , 2008, ICML '08.
[83] Hoa Trang Dang,et al. Overview of the TAC 2008 Update Summarization Task , 2008, TAC.
[84] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.
[85] Jianfeng Gao,et al. An Information-Theoretic Approach to Automatic Evaluation of Summaries , 2006, NAACL.
[86] Atri Rudra,et al. Ordering by weighted number of wins gives a good ranking for weighted tournaments , 2006, SODA '06.
[87] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.
[88] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[89] Jean Carletta,et al. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.
[90] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.
[91] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[92] Ronald Fagin,et al. Comparing top k lists , 2003, SODA '03.
[93] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[94] Moni Naor,et al. Rank aggregation methods for the Web , 2001, WWW '01.
[95] S. Shapiro,et al. Mathematics without Numbers , 1993 .
[96] M. Trick,et al. The computational difficulty of manipulating an election , 1989 .
[97] H. Young. Condorcet's Theory of Voting , 1988, American Political Science Review.
[98] M. Fligner,et al. Multistage Ranking Models , 1988 .
[99] M. Fligner,et al. Distance Based Ranking Models , 1986 .
[100] R. Duncan Luce,et al. Individual Choice Behavior: A Theoretical Analysis , 1979 .
[101] H. Young,et al. A Consistent Extension of Condorcet’s Election Principle , 1978 .
[102] R. Plackett. The Analysis of Permutations , 1975 .
[104] M. Kendall. The treatment of ties in ranking problems. , 1945, Biometrika.
[105] M. Kendall. A NEW MEASURE OF RANK CORRELATION , 1938 .