论文信息 - Phonetically Distributed Continuous Speech Corpus for Thai Language

Phonetically Distributed Continuous Speech Corpus for Thai Language

This paper proposes a work on phonetically balanced sentence (PB) and phonetically distributed sentence (PD) set, which are parts of the text prompt for speech recording in Large Vocabulary Continuous Speech Recognition (LVCSR) corpus for Thai language. Firstly, a protocol of Thai phonetic transcription and some essential rules of phonetic correction after grapheme-to-phoneme (G2P) process are described. An iterative procedure of PB and PD sentence selection is conducted in order to avoid tedious work of manual phone correction on all initial sentences. A standard text corpus, ORCHID, was chosen for the initial text. Analysis of several attributes such as the number of words, syllables, monophones and biphones, phone’s distribution, etc., in both the PB and PD sets are reported. At the end, the final selected PB are partially compared to the American English TIMIT’s PB set (MIT-450) and the Japanese ATR’s 503 PB set.

Supphanat Kanokphara | Chai Wutiwiwatchai | Patcharika Cotsomrong | Sinaporn Suebvisai

[1] Virach Sornlertlamvanich,et al. Speech Technology and Corpus Development in Thailand , 2001 .

[2] Ren-Yuan Lyu,et al. Automatic selection of phonetically distributed sentence sets for speaker adaptation with application to large vocabulary Mandarin speech recognition , 1999, Comput. Speech Lang..

[3] Virach Sornlertlamvanich,et al. Thai grapheme-to-phoneme using probabilistic GLR parser , 2001, INTERSPEECH.

[4] Hitoshi Isahara,et al. Thai Part-of-speech Tagged Corpus: ORCHID , 1998 .