Investigation on dimensionality reduction of concatenated features with deep neural network for LVCSR systems

The hybrid model, context-dependent deep neural network hidden Markov models (CD-DNN-HMMs), has received significant improvements on various challenging large vocabulary continuous speech recognition (LVCSR) tasks just in these few years. Recently, it is further reported that gains of DNN are almost entirely attributed to using features concatenated from consecutive speech frames as DNN's inputs. This result indicates that DNN has the excellent ability of well mining the high-dimensional features. But for GMM, we must resort to dimensionality reduction techniques to avoid the “curse of high-dimensionality”. In this paper, we attempt to derive compact and informative low-dimensional representations from concatenated features for GMM. Most simply, PCA is first considered about, but it doesn't work well in this situation. Then, we focus on investigating DNN-based bottleneck features. The experiments on a Mandarin LVCSR task and the Switchboard task both show that the recognition performance of GMM-HMMs trained with bottleneck features (BN-GMM-HMMs) can be comparable to that of CD-DNN-HMMs. Moreover, when discriminative training is leveraged, surprisingly it is observed that BN-GMM-HMMs provides nearly 8% relative error reductions over CD-DNN-HMMs on the Mandarin LVCSR task.

[1]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[2]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[3]  Dong Yu,et al.  Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yu Hu,et al.  Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[5]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Dong Yu,et al.  Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition , 2010 .

[7]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[9]  Stephen A. Zahorian,et al.  Dimensionality reduction methods for HMM phonetic recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.