论文信息 - Comparison of syllable-based and phoneme-based DNN-HMM in Japanese speech recognition

Comparison of syllable-based and phoneme-based DNN-HMM in Japanese speech recognition

Japanese is syllabic language. Additionally we have studied syllable-based GMM-HMM for Japanese speech recognition. In this paper, we investigate the differences of recognition accuracy using phoneme/syllable-based GMM-HMM and DNN (Deep Neural Network)-HMM. First, we present a comparison of syllable-based and phoneme-based DNN-HMM. Second, we train the tied state left-context dependent syllable DNN-HMM, and compare these three types of modeling method. In the experiment, we obtained a 5% relative gain for WER using left-context syllable DNN-HMM in comparison with a left-context syllable GMM-HMM, and an 11% relative gain for WER using triphone DNN-HMM in comparison with a syllable-based DNN-HMM. Finally, we got results that modeling left-context phoneme has not worked and context independent syllable-based DNN-HMM got the best performance in the experiments, when applied to the ASJ+JNAS corpus, which consists of about 70 hours.

[1] Wonkyum Lee,et al. Modular combination of deep neural networks for acoustic modeling , 2013, INTERSPEECH.

[2] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[3] Xiangang Li,et al. Deep neural networks for syllable based acoustic modeling in Chinese speech recognition , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[4] Nelson Morgan,et al. Informative spectro-temporal bottleneck features for noise-robust speech recognition , 2013, INTERSPEECH.

[5] Navdeep Jaitly,et al. Application of Pretrained Deep Neural Networks to Large Vocabulary Conversational Speech Recognition , 2012 .

[6] László Tóth. Phone recognition with deep sparse rectifier neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Hank Liao,et al. Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[9] Vaibhava Goel,et al. Syllable-a promising recognition unit for LVCSR , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[10] Steven Greenberg,et al. Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11] Jun Wu,et al. Methods towards the very large vocabulary Chinese speech recognition , 1995, EUROSPEECH.

[12] Geoffrey E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[13] László Tóth,et al. A Comparison of Deep Neural Network Training Methods for Large Vocabulary Speech Recognition , 2013, TSD.

[14] Joseph Picone,et al. Advances in alphadigit recognition using syllables , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[16] Michael Picheny,et al. Speaker clustering and transformation for speaker adaptation in speech recognition systems , 1998, IEEE Trans. Speech Audio Process..

[17] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[18] Chiu-yu Tseng,et al. Golden Mandarin (II)-an improved single-chip real-time Mandarin dictation machine for Chinese language with very large vocabulary , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19] Rhys James Jones,et al. Continuous speech recognition using syllables , 1997, EUROSPEECH.

[20] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[21] Kaisheng Yao,et al. Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[22] Seiichi Nakagawa,et al. Large vocabulary speech recognition system: SPOJUS++ , 2011 .