Parallel Deep Neural Network Training for Big Data on Blue Gene/Q

Deep Neural Networks (DNNs) have recently been shown to significantly outperform existing machine learning techniques in several pattern recognition tasks. DNNs are the state-of-the-art models used in image recognition, object detection, classification and tracking, and speech and language processing applications. The biggest drawback to DNNs has been the enormous cost in computation and time taken to train the parameters of the networks-often a tenfold increase relative to conventional technologies. Such training time costs can be mitigated by the application of parallel computing algorithms and architectures. However, these algorithms often run into difficulties because of the cost of inter-processor communication bottlenecks. In this paper, we describe how to enable Parallel Deep Neural Network Training on the IBM Blue Gene/Q (BG/Q) computer system. Specifically, we explore DNN training using the data-parallel Hessian-free 2nd order optimization algorithm. Such an algorithm is particularly well-suited to parallelization across a large set of loosely coupled processors. BG/Q, with its excellent inter-processor communication characteristics, is an ideal match for this type of algorithm. The paper discusses how issues regarding programming model and data-dependent imbalances are addressed. Results on large-scale speech tasks show that the performance on BG/Q scales linearly up to 4,096 processes with no loss in accuracy. This allows us to train neural networks using billions of training examples in a few hours.

[1]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[2]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[3]  Michael Lang,et al.  A Performance and Scalability Analysis of the BlueGene/L Architecture , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[4]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[5]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[6]  Sally A. McKee,et al.  Predicting parallel application performance via machine learning approaches , 2007, Concurr. Comput. Pract. Exp..

[7]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[9]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[10]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[11]  Dong Yu,et al.  Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition , 2010 .

[12]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[13]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[14]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[15]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[16]  Tara N. Sainath,et al.  Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Viatcheslav Gurev,et al.  Toward real-time modeling of human heart ventricles at cellular resolution: Simulation of drug-induced arrhythmias , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[19]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[20]  Michael Gschwind,et al.  Blue Gene/Q: design for sustained multi-petaflop computing , 2012, ICS '12.

[21]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[22]  Daniel Povey,et al.  Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[23]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[24]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[25]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[26]  Tara N. Sainath,et al.  Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[27]  Viatcheslav Gurev,et al.  Science at LLNL with IBM Blue Gene/Q , 2013, IBM J. Res. Dev..

[28]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  George Saon,et al.  A comparison of two optimization techniques for sequence discriminative training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).