Annealed dropout training of deep networks

Recently it has been shown that when training neural networks on a limited amount of data, randomly zeroing, or “dropping out” a fixed percentage of the outputs of a given layer for each training case can improve test set performance significantly. Dropout training discourages the detectors in the network from co-adapting, which limits the capacity of the network and prevents overfitting. In this paper we show that annealing the dropout rate from a high initial value to zero over the course of training can substantially improve the quality of the resulting model. As dropout (approximately) implements model aggregation over an exponential number of networks, this procedure effectively initializes the ensemble of models that will be learned during a given iteration of training with an enemble of models that has a lower average number of neurons per network, and higher variance in the number of neurons per network-which regularizes the structure of the final model toward models that avoid unnecessary co-adaptation between neurons. Importantly, this regularization procedure is stochastic, and so promotes the learning of “balanced” networks with neurons that have high average entropy, and low variance in their entropy, by smoothly transitioning from “exploration” with high learning rates to “fine tuning” with full support for co-adaptation between neurons where necessary. Experimental results demonstrate that annealed dropout leads to significant reductions in word error rate over standard dropout training.

[1]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[2]  Alan L. Yuille,et al.  Statistical Physics, Mixtures of Distributions, and the EM Algorithm , 1994, Neural Computation.

[3]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[4]  Anand Rangarajan,et al.  A new algorithm for non-rigid point matching , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[5]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[6]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[9]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[13]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[14]  DeLiang Wang,et al.  Joint noise adaptive training for robust automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Vaibhava Goel,et al.  Deep Order Statistic Networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).