Multi-Grained Knowledge Distillation for Named Entity Recognition

Although pre-trained big models (e.g., BERT, ERNIE, XLNet, GPT3 etc.) have delivered top performance in Seq2seq modeling, their deployments in real-world applications are often hindered by the excessive computations and memory demand involved. For many applications, including named entity recognition (NER), matching the state-of-the-art result under budget has attracted considerable attention. Drawing power from the recent advance in knowledge distillation (KD), this work presents a novel distillation scheme to efficiently transfer the knowledge learned from big models to their more affordable counterpart. Our solution highlights the construction of surrogate labels through the k-best Viterbi algorithm to distill knowledge from the teacher model. To maximally assimilate knowledge into the student model, we propose a multi-grained distillation scheme, which integrates cross entropy involved in conditional random field (CRF) and fuzzy learning.To validate the effectiveness of our proposal, we conducted a comprehensive evaluation on five NER benchmarks, reporting cross-the-board performance gains relative to competing prior-arts. We further discuss ablation results to dissect our gains.

[1]  Lawrence Carin,et al.  Supercharging Imbalanced Data Learning With Causal Representation Transfer , 2020, ArXiv.

[2]  Chenliang Li,et al.  Exploiting Multiple Embeddings for Chinese Named Entity Recognition , 2019, CIKM.

[3]  Yi Chang,et al.  Iterative Viterbi A* Algorithm for K-Best Sequential Decoding , 2012, ACL.

[4]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[5]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[6]  Teng Ren,et al.  Learning Named Entity Tagger using Domain-Specific Dictionary , 2018, EMNLP.

[7]  Ming Zhou,et al.  A Tensorized Transformer for Language Modeling , 2019, NeurIPS.

[8]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[9]  Yue Zhang,et al.  NCRF++: An Open-source Neural Sequence Labeling Toolkit , 2018, ACL.

[10]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[11]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[12]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[13]  Gina-Anne Levow,et al.  The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[14]  Yue Zhang,et al.  Chinese NER Using Lattice LSTM , 2018, ACL.

[15]  J. Scott McCarley,et al.  Pruning a BERT-based Question Answering Model , 2019, ArXiv.

[16]  Jesper Nielsen A Coarse-to-Fine Approach to Computing the k-Best Viterbi Paths , 2011, CPM.

[17]  Dacheng Tao,et al.  On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[19]  Kewei Tu,et al.  Structure-Level Knowledge Distillation For Multilingual Sequence Labeling , 2020, ACL.

[20]  Ed H. Chi,et al.  Understanding and Improving Knowledge Distillation , 2020, ArXiv.

[21]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[22]  Ahmed Hassan Awadallah,et al.  Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data , 2019, arXiv.org.

[23]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[24]  David Chiang,et al.  Better k-best Parsing , 2005, IWPT.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[27]  Luo Si,et al.  A Neural Multi-digraph Model for Chinese NER with Gazetteers , 2019, ACL.

[28]  Minlong Peng,et al.  Simplify the Usage of Lexicon in Chinese NER , 2019, ACL.

[29]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[31]  Hwee Tou Ng,et al.  Towards Robust Linguistic Analysis using OntoNotes , 2013, CoNLL.

[32]  Chin-Yew Lin,et al.  Towards Improving Neural Named Entity Recognition with Gazetteers , 2019, ACL.

[33]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[34]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[36]  Andrew McCallum,et al.  Fast and Accurate Entity Recognition with Iterated Dilated Convolutions , 2017, EMNLP.

[37]  Koichi Shinoda,et al.  Sequence-level Knowledge Distillation for Model Compression of Attention-based Sequence-to-sequence Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[39]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[40]  Nanyun Peng,et al.  Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings , 2015, EMNLP.

[41]  Wei Wu,et al.  Glyce: Glyph-vectors for Chinese Character Representations , 2019, NeurIPS.

[42]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[43]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[44]  Philippe Langlais,et al.  Robust Lexical Features for Improved Neural Network Named-Entity Recognition , 2018, COLING.

[45]  Xuanjing Huang,et al.  CNN-Based Chinese NER with Lexicon Rethinking , 2019, IJCAI.

[46]  R. Venkatesh Babu,et al.  Data-free Parameter Pruning for Deep Neural Networks , 2015, BMVC.

[47]  Wanxiang Che,et al.  Named Entity Recognition with Bilingual Constraints , 2013, HLT-NAACL.

[48]  Naveen Arivazhagan,et al.  Small and Practical BERT Models for Sequence Labeling , 2019, EMNLP.

[49]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.