Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor

Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a more fine-grained one (the student). The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student’s output distributions. However, for structured prediction problems, the output space is exponential in size; therefore, the cross-entropy objective becomes intractable to compute and optimize directly. In this paper, we derive a factorized form of the knowledge distillation objective for structured prediction, which is tractable for many typical choices of the teacher and student models. In particular, we show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models under four different scenarios: 1) the teacher and student share the same factorization form of the output structure scoring function; 2) the student factorization produces more fine-grained substructures than the teacher factorization; 3) the teacher factorization produces more fine-grained substructures than the student factorization; 4) the factorization forms from the teacher and the student are incompatible.1

[1]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[2]  Kewei Tu,et al.  More Embeddings, Better Sequence Labelers? , 2020, EMNLP.

[3]  Subhabrata Mukherjee,et al.  XtremeDistil: Multi-stage Distillation for Massive Multilingual Models , 2020, ACL.

[4]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[5]  Cícero Nogueira dos Santos,et al.  Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[6]  Lifu Tu,et al.  Benchmarking Approximate Inference Methods for Neural Structured Prediction , 2019, NAACL.

[7]  Quoc V. Le,et al.  BAM! Born-Again Multi-Task Networks for Natural Language Understanding , 2019, ACL.

[8]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[9]  Timothy Dozat,et al.  Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task , 2017, CoNLL.

[10]  Min Zhang,et al.  Efficient Second-Order TreeCRF for Neural Dependency Parsing , 2020, ACL.

[11]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[12]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[13]  Yu Hong,et al.  Don’t Eclipse Your Arts Due to Small Discrepancies: Boundary Repositioning with a Pointer Network for Aspect Extraction , 2020, ACL.

[14]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Kewei Tu,et al.  Enhanced Universal Dependency Parsing with Second-Order Inference and Mixture of Training Data , 2020, IWPT.

[17]  Kewei Tu,et al.  Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning , 2021, ACL/IJCNLP.

[18]  Noah A. Smith,et al.  Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser , 2016, EMNLP.

[19]  Rotem Dror,et al.  Deep Dominance - How to Properly Compare Deep Neural Models , 2019, ACL.

[20]  Kai Yu,et al.  Knowledge Distillation for Sequence Model , 2018, INTERSPEECH.

[21]  Naveen Arivazhagan,et al.  Small and Practical BERT Models for Sequence Labeling , 2019, EMNLP.

[22]  Kewei Tu,et al.  Second-Order Semantic Dependency Parsing with End-to-End Neural Networks , 2019, ACL.

[23]  David Vilares,et al.  Viable Dependency Parsing as Sequence Labeling , 2019, NAACL.

[24]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[25]  Kewei Tu,et al.  Automated Concatenation of Embeddings for Structured Prediction , 2020, ACL.

[26]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[27]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[28]  Yue Zhang,et al.  Design Challenges and Misconceptions in Neural Sequence Labeling , 2018, COLING.

[29]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[30]  Heng Ji,et al.  Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.

[31]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[32]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[33]  Jingzhou Liu,et al.  Stack-Pointer Networks for Dependency Parsing , 2018, ACL.

[34]  Carlos G'omez-Rodr'iguez,et al.  Distilling Neural Networks for Greener and Faster Dependency Parsing , 2020, IWPT.

[35]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[36]  Juntao Yu,et al.  Named Entity Recognition as Dependency Parsing , 2020, ACL.

[37]  Di He,et al.  Multilingual Neural Machine Translation with Knowledge Distillation , 2019, ICLR.

[38]  Kewei Tu,et al.  Structure-Level Knowledge Distillation For Multilingual Sequence Labeling , 2020, ACL.

[39]  Fandong Meng,et al.  GCDT: A Global Context Enhanced Deep Transition Architecture for Sequence Labeling , 2019, ACL.

[40]  Ke Chen,et al.  Structured Knowledge Distillation for Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[42]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.