Advances in Acoustic Modeling for Vietnamese LVCSR

In this paper, we present our experiments on the selection of basic phonetic units for the Vietnamese large vocabulary continuous speech recognition (LVCSR). Two acoustic models were compared. The first model has just used vowels or monophthongs as phonemes [2] while the second one, which was proposed in this paper, has explored the use of diphthongs and triphthongs as phonemes as well. The two models were trained and evaluated on a Broadcast News corpus containing 27 hours of acoustic training data and 1 hour of acoustic testing data. Moreover, an 146M-word corpus collection of newspaper was employed for building the language models. Experimental results indicate significant improvements in both word accuracy rate and time-execution. With the second acoustic model, the word accuracy rates reach 86.06% on the best case and the execution time is faster than the real-time.

[1]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[2]  Vu Hai Quan,et al.  A robust method for the Vietnamese handwritten and speech recognition , 2002, Object recognition supported by user interaction for service robots.

[3]  Duc Duong,et al.  An empirical study of multipass decoding for vietnamese LVCSR , 2008, SLTU.

[4]  Li Deng,et al.  Modeling context-dependent phonetic units in a continuous speech recognition system for Mandarin Chinese , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Tanja Schultz,et al.  Thai automatic speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Dirk Van Compernolle,et al.  Vietnamese Automatic Speech Recognition: The FLaVoR Approach , 2006, ISCSLP.