Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels’ minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.1

[1]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[2]  Noah A. Smith,et al.  Measuring Association Between Labels and Free-Text Rationales , 2020, EMNLP.

[3]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[6]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[7]  Kentaro Inui,et al.  Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets , 2019, AAAI.

[8]  Ellie Pavlick,et al.  Predicting Inductive Biases of Pre-Trained Models , 2021, ICLR.

[9]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[10]  Alex Wang,et al.  Probing What Different NLP Tasks Teach Machines about Function Word Comprehension , 2019, *SEMEVAL.

[11]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[12]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[13]  Rowan Hall Maudslay,et al.  Information-Theoretic Probing for Linguistic Structure , 2020, ACL.

[14]  Dirk Hovy,et al.  Challenges of studying and processing dialects in social media , 2015, NUT@IJCNLP.

[15]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[16]  Mark D. Reckase,et al.  Item Response Theory: Parameter Estimation Techniques , 1998 .

[17]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[18]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[19]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[20]  Long Mai,et al.  Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks? , 2020, FINDINGS.

[21]  William W. Cohen,et al.  Evaluating Explanations: How Much Do Explanations from the Teacher Aid Students? , 2020, TACL.

[22]  Jason Weston,et al.  Finding Generalizable Evidence by Learning to Convince Q&A Models , 2019, EMNLP.

[23]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[24]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[25]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[26]  Richard S. Zemel,et al.  Understanding the Origins of Bias in Word Embeddings , 2018, ICML.

[27]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[28]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[29]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[30]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hyeonjoon Moon,et al.  The FERET evaluation methodology for face-recognition algorithms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[33]  Hao Wu,et al.  Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds , 2019, EMNLP.

[34]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[35]  Ivan Titov,et al.  Information-Theoretic Probing with Minimum Description Length , 2020, EMNLP.

[36]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[37]  Chandler May,et al.  Social Bias in Elicited Natural Language Inferences , 2017, EthNLP@EACL.

[38]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[39]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[40]  Kyunghyun Cho,et al.  Evaluating representations by the complexity of learning low-loss predictors , 2020, ArXiv.

[41]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[42]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[43]  Jieyu Zhao,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[44]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[45]  Shrey Desai,et al.  Calibration of Pre-trained Transformers , 2020, EMNLP.

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[48]  Sameer Singh,et al.  Compositional Questions Do Not Necessitate Multi-hop Reasoning , 2019, ACL.

[49]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[50]  Hannaneh Hajishirzi,et al.  Multi-hop Reading Comprehension through Question Decomposition and Rescoring , 2019, ACL.

[51]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[52]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[53]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[54]  Jason Weston,et al.  Multi-Dimensional Gender Bias Classification , 2020, EMNLP.

[55]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[56]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[57]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[58]  Lei Yu,et al.  Learning and Evaluating General Linguistic Intelligence , 2019, ArXiv.

[59]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[60]  Adolfo Martínez Usó,et al.  Item response theory in AI: Analysing machine learning classifiers at the instance level , 2019, Artif. Intell..

[61]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[62]  Joelle Pineau,et al.  UnNatural Language Inference , 2020, ACL.

[63]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[64]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[65]  Yann Ollivier,et al.  The Description Length of Deep Learning models , 2018, NeurIPS.

[66]  Temple F. Smith Occam's razor , 1980, Nature.

[67]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[68]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[69]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[70]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[71]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[72]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[73]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[74]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[75]  Vivek Srikumar,et al.  BERT & Family Eat Word Salad: Experiments with Text Understanding , 2021, AAAI.

[76]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[77]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[78]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Mohit Bansal,et al.  Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA , 2019, ACL.

[80]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[81]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[82]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[83]  Kyunghyun Cho,et al.  Unsupervised Question Decomposition for Question Answering , 2020, EMNLP.

[84]  Rachael Tatman,et al.  Gender and Dialect Bias in YouTube’s Automatic Captions , 2017, EthNLP@EACL.

[85]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[86]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[87]  Mark Hopkins,et al.  Models of Translation Competitions , 2013, ACL.

[88]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[89]  Yejin Choi,et al.  Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics , 2020, EMNLP.

[90]  Jason Weston,et al.  Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation , 2020, EMNLP.

[91]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[92]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[93]  Greg Durrett,et al.  Understanding Dataset Design Choices for Multi-hop Reasoning , 2019, NAACL.

[94]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[95]  Ye Zhang,et al.  Rationale-Augmented Convolutional Neural Networks for Text Classification , 2016, EMNLP.

[96]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .