Sub-phonetic polynomial segment model for large vocabulary continuous speech recognition

The polynomial segment model (PSM) has opened up an alternative research direction for acoustic modeling. In our previous papers, we proposed efficient incremental likelihood evaluation and EM training algorithms for PSM, that significantly improve the speed of PSM training and recognition. In this paper, we shift our focus to use PSM on large vocabulary recognition. Recognition via N-best re-scoring shows that PSM models out-performed HMM on the 5 k closed vocabulary Wall Street Journal Nov 92 testset. Our best PSM model achieved 7.15% WER compare with 7.81% using 16 mixture HMM model. Specifically, we used sub-phonetic PSM that represents a phoneme as multiple independent segmental units that allows for more effective model sharing. Also, we derived and compared different top-down mixture growing approaches that are orders of magnitude more efficient than previously proposed bottom-up agglomerative clustering techniques. Experimental results show that the top-down clustering performs better than the bottom-up approaches.

[1]  Kuldip K. Paliwal,et al.  Model parameter estimation for mixture density polynomial segment models , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Herbert Gish,et al.  Parametric trajectory mixtures for LVCSR , 1998, ICSLP.

[3]  Jeff Siu-Kei Au-Yeung,et al.  Improved performance of Aurora 4 using HTK and unsupervised MLLR adaptation , 2004, INTERSPEECH.

[4]  Yonghong Yan,et al.  Development Of Cslu Lvcsr: The 1997 Darpa Hub4 Evaluation System , 1998 .

[5]  George Zavaliagkos,et al.  Comparative Experiments on Large Vocabulary Speech Recognition , 1993, HLT.

[6]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[7]  Man-Hung Siu,et al.  Training for polynomial segment model using the expectation maximization algorithm , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  John Makhoul,et al.  Comparative experiments on large vocabulary speech recognition , 1993 .

[9]  Herbert Gish,et al.  A segmental speech model with applications to word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Man-Hung Siu,et al.  Decision tree based tone modeling for Chinese speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Chak-Fai Li,et al.  An efficient incremental likelihood evaluation for polynomial trajectory model using with application to model training and recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Herbert Gish,et al.  Parametric trajectory models for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..