Comparison of syllable-based and phoneme-based DNN-HMM in Japanese speech recognition

Japanese is syllabic language. Additionally we have studied syllable-based GMM-HMM for Japanese speech recognition. In this paper, we investigate the differences of recognition accuracy using phoneme/syllable-based GMM-HMM and DNN (Deep Neural Network)-HMM. First, we present a comparison of syllable-based and phoneme-based DNN-HMM. Second, we train the tied state left-context dependent syllable DNN-HMM, and compare these three types of modeling method. In the experiment, we obtained a 5% relative gain for WER using left-context syllable DNN-HMM in comparison with a left-context syllable GMM-HMM, and an 11% relative gain for WER using triphone DNN-HMM in comparison with a syllable-based DNN-HMM. Finally, we got results that modeling left-context phoneme has not worked and context independent syllable-based DNN-HMM got the best performance in the experiments, when applied to the ASJ+JNAS corpus, which consists of about 70 hours.

[1]  Wonkyum Lee,et al.  Modular combination of deep neural networks for acoustic modeling , 2013, INTERSPEECH.

[2]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[3]  Xiangang Li,et al.  Deep neural networks for syllable based acoustic modeling in Chinese speech recognition , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[4]  Nelson Morgan,et al.  Informative spectro-temporal bottleneck features for noise-robust speech recognition , 2013, INTERSPEECH.

[5]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Conversational Speech Recognition , 2012 .

[6]  László Tóth Phone recognition with deep sparse rectifier neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[9]  Vaibhava Goel,et al.  Syllable-a promising recognition unit for LVCSR , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[10]  Steven Greenberg,et al.  Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Jun Wu,et al.  Methods towards the very large vocabulary Chinese speech recognition , 1995, EUROSPEECH.

[12]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[13]  László Tóth,et al.  A Comparison of Deep Neural Network Training Methods for Large Vocabulary Speech Recognition , 2013, TSD.

[14]  Joseph Picone,et al.  Advances in alphadigit recognition using syllables , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[16]  Michael Picheny,et al.  Speaker clustering and transformation for speaker adaptation in speech recognition systems , 1998, IEEE Trans. Speech Audio Process..

[17]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[18]  Chiu-yu Tseng,et al.  Golden Mandarin (II)-an improved single-chip real-time Mandarin dictation machine for Chinese language with very large vocabulary , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Rhys James Jones,et al.  Continuous speech recognition using syllables , 1997, EUROSPEECH.

[20]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[21]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[22]  Seiichi Nakagawa,et al.  Large vocabulary speech recognition system: SPOJUS++ , 2011 .