Large vocabulary continuous speech recognition with context-dependent DBN-HMMS

The context-independent deep belief network (DBN) hidden Markov model (HMM) hybrid architecture has recently achieved promising results for phone recognition. In this work, we propose a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task. Our system achieves absolute sentence accuracy improvements of 5.8% and 9.2% over GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively, which translate to relative error reductions of 16.0% and 23.2%.

[1]  Mei-Yuh Hwang,et al.  Shared-distribution hidden Markov models for speech recognition , 1993, IEEE Trans. Speech Audio Process..

[2]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[3]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[4]  Dong Yu,et al.  Large-Margin Minimum Classification Error Training for Large-Scale Speech Recognition Tasks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Geoffrey Zweig,et al.  Live search for mobile:Web services by voice on the cellphone , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Wu Chou,et al.  Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[8]  Volodymyr Mnih,et al.  CUDAMat: a CUDA-based matrix class for Python , 2009 .

[9]  James Glass,et al.  Research Developments and Directions in Speech Recognition and Understanding, Part 1 , 2009 .

[10]  James R. Glass,et al.  Developments and directions in speech recognition and understanding, Part 1 [DSP Education] , 2009, IEEE Signal Processing Magazine.

[11]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[12]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[13]  Dong Yu,et al.  Deep-structured hidden conditional random fields for phonetic recognition , 2010, INTERSPEECH.

[14]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[15]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[16]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.