Sensitivity as a Complexity Measure for Sequence Classification Tasks

Abstract We introduce a theoretical framework for understanding and predicting the complexity of sequence classification tasks, using a novel extension of the theory of Boolean function sensitivity. The sensitivity of a function, given a distribution over input sequences, quantifies the number of disjoint subsets of the input sequence that can each be individually changed to change the output. We argue that standard sequence classification methods are biased towards learning low-sensitivity functions, so that tasks requiring high sensitivity are more difficult. To that end, we show analytically that simple lexical classifiers can only express functions of bounded sensitivity, and we show empirically that low-sensitivity functions are easier to learn for LSTMs. We then estimate sensitivity on 15 NLP tasks, finding that sensitivity is higher on challenging tasks collected in GLUE than on simple text classification tasks, and that sensitivity predicts the performance both of simple lexical classifiers and of vanilla BiLSTMs without pretrained contextualized embeddings. Within a task, sensitivity predicts which inputs are hard for such simple models. Our results suggest that the success of massively pretrained contextual representations stems in part because they provide representations from which information can be extracted by low-sensitivity decoders.

[1]  William Merrill,et al.  Sequential Neural Networks as Automata , 2019, Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges.

[2]  Ryan O'Donnell,et al.  Analysis of Boolean Functions , 2014, ArXiv.

[3]  Seth Lloyd,et al.  Deep neural networks are biased towards simple functions , 2018, ArXiv.

[4]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[5]  Nathan Linial,et al.  The influence of variables on Boolean functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[6]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[7]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[8]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[9]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[10]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Jennifer Hu,et al.  A closer look at the performance of neural language models on reflexive anaphor licensing , 2020, SCIL.

[13]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[14]  Leonardo Franco,et al.  Generalization ability of Boolean functions implemented in feedforward neural networks , 2006, Neurocomputing.

[15]  E. Gibson Linguistic complexity: locality of syntactic dependencies , 1998, Cognition.

[16]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[17]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[18]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Chico Q. Camargo,et al.  Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[21]  Ido Dagan,et al.  Recognizing textual entailment: Rational, evaluation and approaches , 2009, Natural Language Engineering.

[22]  Adina Williams,et al.  Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition , 2020, ACL.

[23]  Pooya Hatami,et al.  Variations on the Sensitivity Conjecture , 2011, Theory Comput..

[24]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[25]  Geometry matters: Exploring language examples at the decision boundary , 2020, ArXiv.

[26]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[27]  Zhi-Qin John Xu,et al.  Training behavior of deep neural network in frequency domain , 2018, ICONIP.

[28]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[29]  Roger Levy,et al.  SyntaxGym: An Online Platform for Targeted Evaluation of Language Models , 2020, ACL.

[30]  Kawin Ethayarajh,et al.  Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline , 2018, Rep4NLP@ACL.

[31]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[32]  Don R. Hush,et al.  Bounds on the complexity of recurrent neural network implementations of finite state machines , 1993, Neural Networks.

[33]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[34]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[35]  Qun Liu,et al.  Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order , 2020, ACL.

[36]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[37]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[38]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[39]  Jascha Sohl-Dickstein,et al.  Sensitivity and Generalization in Neural Networks: an Empirical Study , 2018, ICLR.

[40]  Roger Levy,et al.  Neural language models as psycholinguistic subjects: Representations of syntactic state , 2019, NAACL.

[41]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[42]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[43]  T. Sanders,et al.  Analysis of Boolean Functions , 2012, ArXiv.

[44]  Yuan Cao,et al.  Towards Understanding the Spectral Bias of Deep Learning , 2021, IJCAI.

[45]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[46]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[47]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP 2018.

[48]  Anna Bernasconi Sensitivity vs. Block Sensitivity (an Average-Case Study) , 1996, Inf. Process. Lett..

[49]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[50]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[51]  John Hale,et al.  A Probabilistic Earley Parser as a Psycholinguistic Model , 2001, NAACL.

[52]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[53]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[54]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[55]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[56]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[57]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[58]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[59]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[60]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[61]  Samuel R. Bowman,et al.  A Gold Standard Dependency Corpus for English , 2014, LREC.

[62]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[63]  Noam Chomsky,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[64]  Samuel R. Bowman,et al.  Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark , 2019, ACL.

[65]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[66]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[67]  Noam Nisan CREW PRAMs and Decision Trees , 1991, SIAM J. Comput..

[68]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[69]  Timothy Dozat,et al.  Universal Dependency Parsing from Scratch , 2019, CoNLL.

[70]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[71]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[72]  Michael Hahn,et al.  Theoretical Limitations of Self-Attention in Neural Sequence Models , 2019, TACL.

[73]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.