Thai Speech Recognition Corpora

Nowadays, the improvement of speech recognition technology is growing fast and many techniques have been proposed. However, even the best algorithm with carefully designed system cannot accomplish good-performance speech recognition if the system is trained from unreliable corpus. Therefore, the speech corpus is a crucial research area. This paper describes the speech corpus (ORCHID-SPEECH CORPUS and NECTEC-ATR Thai speech corpus), which is developed for Thai speech recognition. It also indicates how the speech corpus is built in order to preserve important properties: consistency, balance, and containing possible phoneme combinations. Therefore, the corpus design, the details of each corpus set, and problem of them are also presented in this paper.