论文信息 - Two-stage ASGD framework for parallel training of DNN acoustic models using Ethernet

Two-stage ASGD framework for parallel training of DNN acoustic models using Ethernet

Deep neural networks have shown significant improvements on acoustic modelling, pushing state-of-the-art performance in large vocabulary continuous speech recognition (LVCSR) tasks. However, training DNNs is very time-consuming on scaled data. In this paper, a data-parallel method, namely two-stage ASGD, is proposed. Two-stage ASGD is based on asynchronous stochastic gradient descent (ASGD) paradigm and is tuned for GPU-equipped computing cluster connected by 10Gbit/s Ethernet other than Infiniband. Several techniques, such as hierarchical learning rate control, double-buffering and order-locking are applied to optimise the communication-to-transmission ratio. The proposed framework is evaluated by training a DNN with 29.5M parameters using a 500-hours Chinese continuous telephone speech data set. By using 4 computer nodes and 8 GPU devices (2 devices used in each node), a 5.9 times acceleration is obtained over a single GPU with acceptable loss of accuracy (0.5% in average). A comparative experiment is done to compare the proposed two-stage ASGD with the parallel DNN training systems reported in prior work.

Yonghong Yan | Xin Li | Zhichao Wang | Jielin Pan | Xingyu Na

[1] Yonghong Yan,et al. Improved modeling for F0 generation and V/U decision in HMM-based TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2] Li-Rong Dai,et al. A cluster-based multiple deep neural networks method for large vocabulary continuous speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[5] Geoffrey E. Hinton,et al. Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[7] H Hermansky,et al. Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[8] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[9] Dong Yu,et al. Pipelined Back-Propagation for Context-Dependent Deep Neural Networks , 2012, INTERSPEECH.

[10] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[11] Shiliang Zhang,et al. Improving deep neural networks for LVCSR using dropout and shrinking structure , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[13] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[14] Dong Yu,et al. On parallelizability of stochastic gradient descent for speech DNNS , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Yi Li,et al. Mariana: Tencent Deep Learning Platform and its Applications , 2014, Proc. VLDB Endow..

[16] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17] Rong Zheng,et al. Asynchronous stochastic gradient descent for DNN training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.