论文信息 - Multi-Grained Knowledge Distillation for Named Entity Recognition - 字舞流文

Multi-Grained Knowledge Distillation for Named Entity Recognition

Although pre-trained big models (e.g., BERT, ERNIE, XLNet, GPT3 etc.) have delivered top performance in Seq2seq modeling, their deployments in real-world applications are often hindered by the excessive computations and memory demand involved. For many applications, including named entity recognition (NER), matching the state-of-the-art result under budget has attracted considerable attention. Drawing power from the recent advance in knowledge distillation (KD), this work presents a novel distillation scheme to efficiently transfer the knowledge learned from big models to their more affordable counterpart. Our solution highlights the construction of surrogate labels through the k-best Viterbi algorithm to distill knowledge from the teacher model. To maximally assimilate knowledge into the student model, we propose a multi-grained distillation scheme, which integrates cross entropy involved in conditional random field (CRF) and fuzzy learning.To validate the effectiveness of our proposal, we conducted a comprehensive evaluation on five NER benchmarks, reporting cross-the-board performance gains relative to competing prior-arts. We further discuss ablation results to dissect our gains.

Chenyang Tao | Wei Wang | Junya Chen | Xuan Zhou | Xiao Zhang | Bing Xu | Jing Xiao | Chenyang Tao | Jing Xiao | Wei Wang | Junya Chen | Xuan Zhou | Bing Xu | Xiao Zhang

[1] Lawrence Carin,et al. Supercharging Imbalanced Data Learning With Causal Representation Transfer , 2020, ArXiv.

[2] Chenliang Li,et al. Exploiting Multiple Embeddings for Chinese Named Entity Recognition , 2019, CIKM.

[3] Yi Chang,et al. Iterative Viterbi A* Algorithm for K-Best Sequential Decoding , 2012, ACL.

[4] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[5] Jimmy J. Lin,et al. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[6] Teng Ren,et al. Learning Named Entity Tagger using Domain-Specific Dictionary , 2018, EMNLP.

[7] Ming Zhou,et al. A Tensorized Transformer for Language Modeling , 2019, NeurIPS.

[8] Eric Nichols,et al. Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[9] Yue Zhang,et al. NCRF++: An Open-source Neural Sequence Labeling Toolkit , 2018, ACL.

[10] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[11] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[12] Wei Xu,et al. Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[13] Gina-Anne Levow,et al. The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[14] Yue Zhang,et al. Chinese NER Using Lattice LSTM , 2018, ACL.

[15] J. Scott McCarley,et al. Pruning a BERT-based Question Answering Model , 2019, ArXiv.

[16] Jesper Nielsen. A Coarse-to-Fine Approach to Computing the k-Best Viterbi Paths , 2011, CPM.

[17] Dacheng Tao,et al. On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[19] Kewei Tu,et al. Structure-Level Knowledge Distillation For Multilingual Sequence Labeling , 2020, ACL.

[20] Ed H. Chi,et al. Understanding and Improving Knowledge Distillation , 2020, ArXiv.

[21] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[22] Ahmed Hassan Awadallah,et al. Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data , 2019, arXiv.org.

[23] Yu Sun,et al. ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[24] David Chiang,et al. Better k-best Parsing , 2005, IWPT.

[25] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26] Ming-Wei Chang,et al. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[27] Luo Si,et al. A Neural Multi-digraph Model for Chinese NER with Gazetteers , 2019, ACL.

[28] Minlong Peng,et al. Simplify the Usage of Lexicon in Chinese NER , 2019, ACL.

[29] Jian Cheng,et al. Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[31] Hwee Tou Ng,et al. Towards Robust Linguistic Analysis using OntoNotes , 2013, CoNLL.

[32] Chin-Yew Lin,et al. Towards Improving Neural Named Entity Recognition with Gazetteers , 2019, ACL.

[33] Andrew J. Viterbi,et al. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[34] Roberto Cipolla,et al. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35] Xuanjing Huang,et al. How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[36] Andrew McCallum,et al. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions , 2017, EMNLP.

[37] Koichi Shinoda,et al. Sequence-level Knowledge Distillation for Model Compression of Attention-based Sequence-to-sequence Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[39] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[40] Nanyun Peng,et al. Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings , 2015, EMNLP.

[41] Wei Wu,et al. Glyce: Glyph-vectors for Chinese Character Representations , 2019, NeurIPS.

[42] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[43] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[44] Philippe Langlais,et al. Robust Lexical Features for Improved Neural Network Named-Entity Recognition , 2018, COLING.

[45] Xuanjing Huang,et al. CNN-Based Chinese NER with Lexicon Rethinking , 2019, IJCAI.

[46] R. Venkatesh Babu,et al. Data-free Parameter Pruning for Deep Neural Networks , 2015, BMVC.

[47] Wanxiang Che,et al. Named Entity Recognition with Bilingual Constraints , 2013, HLT-NAACL.

[48] Naveen Arivazhagan,et al. Small and Practical BERT Models for Sequence Labeling , 2019, EMNLP.

[49] Ming Yang,et al. Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.