A diversity-penalizing ensemble training method for deep learning

A common way to improve the performance of deep learning is to train an ensemble of neural networks and combine them during decoding. However, this is computationally expensive in test time. In this paper, we propose an diversity-penalizing ensemble training (DPET) procedure, which trains an ensemble of DNNs, whose parameters were differently initialized, and penalizes differences between each individual DNN’s output and their average output. This way each model learns to emulate the average of the whole ensemble of models, and in test time we can use one arbitrarily chosen member of the ensemble. Experimental results on a variety of speech recognition tasks show that this technique is effective, and gives us most of the WER improvement of the ensemble method while being no more expensive in test time than using a single model.

[1]  Huanhuan Chen,et al.  Regularized Negative Correlation Learning for Neural Network Ensembles , 2009, IEEE Transactions on Neural Networks.

[2]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[3]  David Yarowsky,et al.  A keyword search system using open source software , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[4]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[5]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[6]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Xiaohui Zhang,et al.  Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[9]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[10]  Li Deng,et al.  Ensemble deep learning for speech recognition , 2014, INTERSPEECH.

[11]  César Hervás-Martínez,et al.  Cooperative coevolution of artificial neural network ensembles for pattern classification , 2005, IEEE Transactions on Evolutionary Computation.

[12]  Xin Yao,et al.  Ensemble learning via negative correlation , 1999, Neural Networks.

[13]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[14]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[15]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[16]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Martin Karafi iVector-Based Discriminative Adaptation for Automatic Speech Recognition , 2011 .