Parallel training of DNNs with Natural Gradient and Parameter Averaging

We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.

[1]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[2]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[3]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  G. M. D. Corso Estimating an Eigenvector by the Power Method with a Random Start , 1997 .

[5]  Shun-ichi Amari,et al.  Complexity Issues in Natural Gradient Descent Method for Training Multilayer Perceptrons , 1998, Neural Computation.

[6]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[7]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[11]  Peter L. Bartlett,et al.  Adaptive Online Gradient Descent , 2007, NIPS.

[12]  Brian Kingsbury,et al.  Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[14]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[17]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[18]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[20]  Michiel Bacchiani,et al.  Rapid adaptation for mobile speech applications , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[22]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[23]  Razvan Pascanu,et al.  Natural Gradient Revisited , 2013, ICLR.

[24]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).