Ensemble Distillation Approaches for Grammatical Error Correction

Ensemble approaches are commonly used techniques to improving a system by combining multiple model predictions. Additionally these schemes allow the uncertainty, as well as the source of the uncertainty, to be derived for the prediction. Unfortunately these benefits come at a computational and memory cost. To address this problem ensemble distillation (EnD) and more recently ensemble distribution distillation (EnDD) have been proposed that compress the ensemble into a single model, representing either the ensemble average prediction or prediction distribution respectively. This paper examines the application of both these distillation approaches to a sequence prediction task, grammatical error correction (GEC). This is an important application area for language learning tasks as it can yield highly useful feedback to the learner. It is, however, more challenging than the standard tasks investigated for distillation as the prediction of any grammatical correction to a word will be highly dependent on both the input sequence and the generated output history for the word. The performance of both EnD and EnDD are evaluated on both publicly available GEC tasks as well as a spoken language task.

[1]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Mark J. F. Gales,et al.  Incorporating Uncertainty into Deep Learning for Spoken Language Assessment , 2017, ACL.

[4]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[5]  Tony X. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[6]  Ryo Masumura,et al.  Domain adaptation of DNN acoustic models using knowledge distillation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yarin Gal,et al.  Understanding Measures of Uncertainty for Adversarial Example Detection , 2018, UAI.

[8]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[9]  Brian Kingsbury,et al.  Knowledge distillation across ensembles of multilingual models for low-resource languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[11]  Mark J. F. Gales,et al.  Sequence Student-Teacher Training of Deep Neural Networks , 2016, INTERSPEECH.

[12]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[13]  Mark J. F. Gales,et al.  Spoken Language 'Grammatical Error Correction' , 2020, INTERSPEECH.

[14]  Mark J. F. Gales,et al.  Predictive Uncertainty Estimation via Prior Networks , 2018, NeurIPS.

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[17]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[18]  Mark J. F. Gales,et al.  Automatic Grammatical Error Detection of Non-native Spoken Learner English , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tara N. Sainath,et al.  Compression of End-to-End Models , 2018, INTERSPEECH.

[20]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[21]  Graham Neubig,et al.  Understanding Knowledge Distillation in Non-autoregressive Machine Translation , 2020, ICLR.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[24]  Matt Post,et al.  Ground Truth for Grammatical Error Correction Metrics , 2015, ACL.

[25]  Andrey Malinin,et al.  Ensemble Distribution Distillation , 2019, ICLR.

[26]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Kiyotaka Uchimoto,et al.  The NICT JLE Corpus Exploiting the language learners' speech database for research and education , 2004 .

[28]  D Nicholls,et al.  The Cambridge Learner Corpus-Error coding and analysis , 1999 .

[29]  Kai Yu,et al.  Knowledge Distillation for Sequence Model , 2018, INTERSPEECH.

[30]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[31]  M. Gales,et al.  Uncertainty in Structured Prediction , 2020, ArXiv.

[32]  Ted Briscoe,et al.  Grammatical error correction using neural machine translation , 2016, NAACL.

[33]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.