Exploiting Redundancy in Pre-trained Language Models for Efficient Transfer Learning

Large pre-trained contextual word representations have transformed the field of natural language processing, obtaining impressive results on a wide range of tasks. However, as models increase in size, computational limitations make them impractical for researchers and practitioners alike. We hypothesize that contextual representations have both intrinsic and task-specific redundancies. We propose a novel feature selection method, which takes advantage of these redundancies to reduce the size of the pre-trained features. In a comprehensive evaluation on two pre-trained models, BERT and XLNet, using a diverse suite of sequence labeling and sequence classification tasks, our method reduces the feature set down to 1--7% of the original size, while maintaining more than 97% of the performance.

[1]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[2]  Debajyoti Chatterjee Making Neural Machine Reading Comprehension Faster , 2019, ArXiv.

[3]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[4]  Johan Bos,et al.  The Parallel Meaning Bank: Towards a Multilingual Corpus of Translations Annotated with Compositional Meaning Representations , 2017, EACL.

[5]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[6]  Max Kuhn,et al.  Feature Engineering and Selection , 2019 .

[7]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[10]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[11]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[12]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[13]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[14]  Julia Hockenmaier,et al.  Creating a CCGbank and a Wide-Coverage CCG Lexicon for German , 2006, ACL.

[15]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[16]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[17]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[18]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[19]  Nan Yang,et al.  Attention-Guided Answer Distillation for Machine Reading Comprehension , 2018, EMNLP.

[20]  Noah A. Smith,et al.  To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[21]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[22]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[23]  Chandra Bhagavatula,et al.  Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[24]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[25]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[26]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[27]  Naveen Arivazhagan,et al.  Small and Practical BERT Models for Sequence Labeling , 2019, EMNLP.

[28]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[29]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[30]  WestonJason,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002 .

[31]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[32]  Shilan S. Hameed,et al.  Filter-Wrapper Combination and Embedded Feature Selection for Gene Expression Data , 2018 .

[33]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[34]  Atsushi Fujita,et al.  Recurrent Stacking of Layers for Compact Neural Machine Translation Models , 2018, AAAI.

[35]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[36]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[37]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[38]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[39]  Yonatan Belinkov,et al.  What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models , 2018, AAAI.