Parallel deep neural network training for LVCSR tasks using blue gene/Q

While Deep Neural Networks (DNNs) have achieved tremendous success for LVCSR tasks, training these networks is slow. To date, the most common approach to train DNNs is via stochastic gradient descent (SGD), serially on a single GPU machine. Serial training, coupled with the large number of training parameters and speech data set sizes, makes DNN training very slow for LVCSR tasks. While 2nd order, data-parallel methods have also been explored, these methods are not always faster on CPU clusters due to the large communication cost between processors. In this work, we explore using a specialized hardware/software approach, utilizing a Blue Gene/Q (BG/Q) system, which has thousands of processors and excellent interprocessor communication. We explore using the 2nd order Hessian-free (HF) algorithm for DNN training with BG/Q, for both cross-entropy and sequence training of DNNs. Results on three LVCSR tasks indicate that using HF with BG/Q offers up to an 11x speedup, as well as an improved word error rate (WER), compared to SGD on a GPU.

[1]  Jun Doi Peta-scale Lattice Quantum Chromodynamics on a Blue Gene/Q supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Tara N. Sainath,et al.  Kernel methods match Deep Neural Networks on TIMIT , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[4]  Michael Gschwind,et al.  Blue Gene/Q: design for sustained multi-petaflop computing , 2012, ICS '12.

[5]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Myron Flickner,et al.  Compass: A scalable simulator for an architecture for cognitive computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[8]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[9]  George Saon,et al.  A comparison of two optimization techniques for sequence discriminative training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[11]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[12]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[13]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[15]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[16]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[17]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[18]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[19]  Daniel Povey,et al.  Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[20]  Tara N. Sainath,et al.  Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[21]  Tara N. Sainath,et al.  Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[22]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[23]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[24]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.