First steps in building a large vocabulary continuous speech recognition system for Vietnamese

This paper presents an overview of our activities for building a Large Vocabulary Continuous Speech Recognition (LVCSR) system for Vietnamese implemented at CLIPS-IMAG Laboratory (France) and International Research Center MICA (Vietnam). Firstly, a new methodology for fast text corpora acquisition for minority languages which has been applied to Vietnamese is proposed. Secondly, the first results of a process of building a large speech corpus for Vietnamese (VNSpeechCorpus) and a phonetic dictionary, which is used for automatic alignment process, are also presented. Finally, a language model and an acoustic model are constructed to obtain a LVCSR system for Vietnamese. Index Terms Automatic Speech Recognition, LVCSR, text corpus, speech corpus, pronunciation dictionary, language modeling, acoustic modeling.