Development of a Large Vocabulary Continuous Speech Recognition System for Rich Transcription Evaluation Using HTK

The focus of this research is to build a large vocabulary continuous speech recognition (LVCSR) system that converts speech to text in accordance with the National Institute of Standards and Technology(NIST) Rich Transcription (RTE) Evaluation requirements. The result of the current effort will serve as a baseline for future work in the development of an advanced speech recognition system based on WUW technology. Cambridge University’s HTK Speech Recognition Toolkit Version 3.4 serves as the engine in this process. In order to create a sufficiently large speech dataset, multiple corpora are combined, including TIMIT, and NIST RTE 2006 (RT06) and 2007 (RT07) data. Recognition testing and evaluation is performed under a variety of different conditions to find the ideal parameters for optimum accuracy. Modifiable factors include insertion penalties (IP), language models, phonetic questioning, bootstrapping, and skip states. Performance is measured by word error rate (WER). The addition of insertion balancing consistently improved WER at both phone-level and wordlevel, while the removal of TIMIT shibboleth sentences demonstrated no significant change in WER. Phonetic questioning effectively improved computation time without a significant increase of WER. Training and testing on TIMIT corpus data with the implementation of a language model attained the lowest WER of 78.74%. Although 78.74% WER is higher than what other research has achieved with HTK, the addition of the language model improved WER by a relative difference of 48.97%. Additionally, performance is better than expected for using only 1 Gaussian Mixture and 3 output states per HMM.

[1]  Mark J. F. Gales,et al.  Development of the 2003 CU-HTK conversational telephone speech transcription system , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Ilan D. Shallom,et al.  A comparison Study of Cepstral Analysis with Applications to Speech Recognition , 2006, 2006 International Conference on Information Technology: Research and Education.

[3]  T. Kinjo,et al.  On HMM Speech Recognition Based on Complex Speech Analysis , 2006, IECON 2006 - 32nd Annual Conference on IEEE Industrial Electronics.

[4]  Sorin Davidovici,et al.  Narrow-band interference rejection using real-time Fourier transforms , 1989, IEEE Trans. Commun..

[5]  Yasuhiro Minami Mixture Gaussian HMM-trajctory method using likelihood compensation , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[6]  N. Ellouze,et al.  Experimental study of the HMMs effect on the word recognition performance , 2004, First International Symposium on Control, Communications and Signal Processing, 2004..

[7]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Jason Jianjun Gu,et al.  An HTK-developed hidden Markov model (HMM) for a voice-controlled robotic system , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[9]  Hermann Ney,et al.  Continuous-speech recognition using a stochastic language model , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[10]  Jun Cai,et al.  Dynamic Gaussian selection technique for speeding up HMM-based continuous speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.