论文信息 - Two Tiered Distributed Training Algorithm for Acoustic Modeling

Two Tiered Distributed Training Algorithm for Acoustic Modeling

We present a hybrid approach for scaling distributed training of neural networks by combining Gradient Threshold Compression (GTC) algorithm a variant of stochastic gradient descent (SGD) which compresses gradients with thresholding and quantization techniques and Blockwise Model Update Filtering (BMUF) algorithm a variant of model averaging (MA). In this proposed method we divide total number of workers into smaller subgroups in a hierarchical manner and limits frequent communication within each subgroup. We update local model using GTC within a subgroup and global model using BMUF across different subgroups. We evaluated this approached by training deep long short-term memory (LSTM) recurrent neural network for automatic speech recognition (ASR) problem on a 2000 hour audio dataset, and comparing to BMUF training with 128 GPUs, the proposed approach delivers 1.25x relative speed up (~100x speed up comparing to single GPU) and reduces relative WER degradation by 100%.

[1] Sree Hari Krishnan Parthasarathi,et al. Lessons from Building Acoustic Models with a Million Hours of Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3] Lei Xie,et al. Empirical Evaluation of Parallel Training Algorithms on Acoustic Modeling , 2017, INTERSPEECH.

[4] Qiang Huo,et al. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Dan Alistarh,et al. QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent , 2016, ArXiv.

[6] Sree Hari Krishnan Parthasarathi,et al. fMLLR based feature-space speaker adaptation of DNN acoustic models , 2015, INTERSPEECH.

[7] Xin Yuan,et al. Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[8] Pranav Ladkat,et al. Realizing Petabyte Scale Acoustic Modeling , 2019, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[9] Zheng Xu,et al. Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[10] Xiaohui Zhang,et al. Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[11] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[12] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[13] Sree Hari Krishnan Parthasarathi,et al. Robust Speech Recognition via Anchor Word Representations , 2017, INTERSPEECH.

[14] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[15] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[16] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[17] Yuanzhou Yang,et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[18] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[19] Tara N. Sainath,et al. Parallel Deep Neural Network Training for Big Data on Blue Gene/Q , 2017, IEEE Transactions on Parallel and Distributed Systems.

[20] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[21] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[22] Geoffrey Zweig,et al. The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Sree Hari Krishnan Parthasarathi,et al. Robust i-vector based adaptation of DNN acoustic model for speech recognition , 2015, INTERSPEECH.

[24] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[25] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .

[26] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[27] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[28] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[29] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[30] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.