Improving deep neural networks for LVCSR using dropout and shrinking structure

Recently, the hybrid deep neural networks and hidden Markov models (DNN/HMMs) have achieved dramatic gains over the conventional GMM/HMMs method on various large vocabulary continuous speech recognition (LVCSR) tasks. In this paper, we propose two new methods to further improve the hybrid DNN/HMMs model: i) use dropout as pre-conditioner (DAP) to initialize DNN prior to back-propagation (BP) for better recognition accuracy; ii) employ a shrinking DNN structure (sDNN) with hidden layers decreasing in size from bottom to top for the purpose of reducing model size and expediting computation time. The proposed DAP method is evaluated in a 70-hour Mandarin transcription (PSC) task and the 309-hour Switchboard (SWB) task. Compared with the traditional greedy layer-wise pre-trained DNN, it can achieve about 10% and 6.8% relative recognition error reduction for PSC and SWB tasks respectively. In addition, we also evaluate sDNN as well as its combination with DAP on the SWB task. Experimental results show that these methods can reduce model size to 45% of original size and accelerate training and test time by 55%, without losing recognition accuracy.

[1]  政子 鶴岡,et al.  1998 IEEE International Conference on SMCに参加して , 1998 .

[2]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Hui Jiang,et al.  Investigation on dimensionality reduction of concatenated features with deep neural network for LVCSR systems , 2012, 2012 IEEE 11th International Conference on Signal Processing.

[4]  Florian Metze,et al.  Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training , 2013, INTERSPEECH.

[5]  Dong Yu,et al.  Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Li-Rong Dai,et al.  Sequence training of multiple deep neural networks for better performance and faster training speed , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Li-Rong Dai,et al.  A cluster-based multiple deep neural networks method for large vocabulary continuous speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[11]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[12]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[13]  Yifan Gong,et al.  Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.

[14]  Jie Li,et al.  Understanding the dropout strategy and analyzing its effectiveness on LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[16]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Tara N. Sainath,et al.  Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[18]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Yu Hu,et al.  Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[20]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[21]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[22]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.