Language Modelling as a Multi-Task Problem

In this paper, we propose to study language modelling as a multi-task problem, bringing together three strands of research: multitask learning, linguistics, and interpretability. Based on hypotheses derived from linguistic theory, we investigate whether language models adhere to learning principles of multi-task learning during training. To showcase the idea, we analyse the generalisation behaviour of language models as they learn the linguistic concept of Negative Polarity Items (NPIs). Our experiments demonstrate that a multi-task setting naturally emerges within the objective of the more general task of language modelling. We argue that this insight is valuable for multitask learning, linguistics and interpretability research and can lead to exciting new findings in all three domains.

[1]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[2]  Samuel R. Bowman,et al.  Can neural networks acquire a structural bias from raw linguistic data? , 2020, CogSci.

[3]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.

[4]  Xiaoou Tang,et al.  Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[5]  J. Mestre Transfer of learning from a modern multidisciplinary perspective , 2005 .

[6]  Dirk Hovy,et al.  Multitask Learning for Mental Health Conditions with Limited Social Media Data , 2017, EACL.

[7]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[8]  Sebastian Thrun,et al.  Discovering Structure in Multiple Learning Tasks: The TC Algorithm , 1996, ICML.

[9]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[10]  Jaime G. Carbonell,et al.  Characterizing and Avoiding Negative Transfer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Paul Portner,et al.  Semantics: An International Handbook of Natural Language Meaning , 2011 .

[12]  Mark Johnson,et al.  An Improved Non-monotonic Transition System for Dependency Parsing , 2015, EMNLP.

[13]  M. Cole,et al.  Cognitive Development: Its Cultural and Social Foundations , 1976 .

[14]  Chris Barker,et al.  Negative polarity as scope marking , 2018 .

[15]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[16]  Lukasz Kaiser,et al.  One Model To Learn Them All , 2017, ArXiv.

[17]  Luke S. Zettlemoyer,et al.  Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[20]  Andreas Maurer,et al.  Bounds for Linear Multi-Task Learning , 2006, J. Mach. Learn. Res..

[21]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[22]  Jacques Wainer,et al.  Flexible Modeling of Latent Task Structures in Multitask Learning , 2012, ICML.

[23]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[24]  Shikha Bordia,et al.  Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs , 2019, EMNLP.

[25]  Qiang Yang,et al.  A Survey on Multi-Task Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[26]  Thomas G. Dietterich,et al.  To transfer or not to transfer , 2005, NIPS 2005.

[27]  Jitendra Malik,et al.  Which Tasks Should Be Learned Together in Multi-task Learning? , 2019, ICML.

[28]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[29]  Jacob Hoeksema,et al.  On the natural history of negative polarity items , 2012 .

[30]  S. Cormier,et al.  Transfer of learning: Contemporary research and applications. , 1987 .

[31]  Jaap Jumelet,et al.  diagNNose: A Library for Neural Activation Analysis , 2020, BLACKBOXNLP.

[32]  Charles A. Micchelli,et al.  A Spectral Regularization Framework for Multi-Task Structure Learning , 2007, NIPS.

[33]  Joachim Bingel,et al.  Identifying beneficial task relations for multi-task learning in deep neural networks , 2017, EACL.

[34]  Dieuwke Hupkes,et al.  Do Language Models Understand Anything? On the Ability of LSTMs to Understand Negative Polarity Items , 2018, BlackboxNLP@EMNLP.

[35]  Roger Levy,et al.  Structural Supervision Improves Learning of Non-Local Grammatical Dependencies , 2019, NAACL.

[36]  Gilles Fauconnier,et al.  Polarity and the Scale Principle , 1975 .

[37]  Anders Søgaard,et al.  Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.

[38]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[39]  A. Giannakidou,et al.  Negative and Positive Polarity Items: Variation, Licensing, and Compositionality , 2008 .

[40]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[41]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[42]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[43]  William A. Ladusaw Polarity sensitivity as inherent scope relations , 1980 .

[44]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.