论文信息 - Universal Language Model Fine-tuning for Text Classification

Universal Language Model Fine-tuning for Text Classification

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100 times more data. We open-source our pretrained models and code.

Sebastian Ruder | Jeremy Howard | Sebastian Ruder | Jeremy Howard

[1] Rich Caruana,et al. Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[2] Ellen M. Voorhees,et al. The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[3] Jonathan Baxter,et al. A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[4] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[5] V. Vapnik. Estimation of Dependences Based on Empirical Data , 2006 .

[6] Bing Liu,et al. Review spam detection , 2007, WWW '07.

[7] John Blitzer,et al. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[8] Geoffrey E. Hinton,et al. Deep Boltzmann Machines , 2009, AISTATS.

[9] Herbert L. Roitblat,et al. Document categorization in legal electronic discovery: computer classification vs. manual review , 2010 .

[10] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[12] Yong Hu,et al. The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature , 2011, Decis. Support Syst..

[13] John Yen,et al. Classifying text messages for the haiti earthquake , 2011, ISCRAM.

[14] Sushil Jajodia,et al. Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? , 2012, IEEE Transactions on Dependable and Secure Computing.

[15] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16] Peter Glöckner,et al. Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[17] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.

[18] Stefan Carlsson,et al. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[19] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[20] Jitendra Malik,et al. Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[22] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[24] Alessandro Moschitti,et al. UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification , 2015, *SEMEVAL.

[25] Michael I. Jordan,et al. Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[26] Trevor Darrell,et al. Fully convolutional networks for semantic segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Zhi Jin,et al. Discriminative Neural Sentence Modeling by Tree-Based Convolution , 2015, EMNLP.

[28] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Emmanuel Dupoux,et al. Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[30] Alexei A. Efros,et al. What makes ImageNet good for transfer learning? , 2016, ArXiv.

[31] Sebastian Ruder,et al. An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[32] Peng Zhou,et al. Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling , 2016, COLING.

[33] Tong Zhang,et al. Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings , 2016, ICML.

[34] Rui Yan,et al. How Transferable are Neural Networks in NLP Applications? , 2016, EMNLP.

[35] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[36] Chandra Bhagavatula,et al. Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[37] Iyad Rahwan,et al. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm , 2017, EMNLP.

[38] Andrew M. Dai,et al. Adversarial Training Methods for Semi-Supervised Text Classification , 2016, ICLR.

[39] Leslie N. Smith,et al. Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[40] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Ilya Sutskever,et al. Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[42] Tong Zhang,et al. Deep Pyramid Convolutional Neural Networks for Text Categorization , 2017, ACL.

[43] Kevin Gimpel,et al. Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings , 2017, ACL.

[44] Timothy Dozat,et al. Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[45] Marek Rei,et al. Semi-supervised Multitask Learning for Sequence Labeling , 2017, ACL.

[46] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[47] Hannaneh Hajishirzi,et al. Question Answering through Transfer Learning from Large Fine-grained Supervision Data , 2017, ACL.

[48] Richard Socher,et al. Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[49] Holger Schwenk,et al. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[50] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.

[51] Kaiming He,et al. Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[52] Edouard Grave,et al. Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[53] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[54] Zhao Chen,et al. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.

[55] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[56] Xiang Ren,et al. Empower Sequence Labeling with Task-Aware Neural Language Model , 2017, AAAI.

[57] Nicholay Topin,et al. Super-convergence: very fast training of neural networks using large learning rates , 2018, Defense + Commercial Sensing.