论文信息 - LOTUS-BN: A Thai broadcast news corpus and its research applications

LOTUS-BN: A Thai broadcast news corpus and its research applications

This paper describes the design and construction of the LOTUS-BN corpus, a Thai television broadcast news corpus. In addition to audio recordings and their transcription, this corpus also includes a detailed annotation of many interesting characteristics of broadcast news data such as acoustic condition, overlapping speech, news topic and named entity. The LOTUS-BN is still an ongoing project with the goal of collecting 100 hours of speech. We report initial statistics analyzed from 60 hours of speech which show that the LOTUS-BN corpus has a rich vocabulary of approximately 26,000 words with one third of them are named entities. Thus, this corpus is a good resource for developing an LVCSR system and investigating on named entity detection and recognition in addition to broadcast news related applications. Research applications on these topics are also discussed.

Chai Wutiwiwatchai | Ananlada Chotimongkol | Nattanun Thatphithakkul | Patcharika Chootrakool | Kwanchiva Saykhum

[1] Chai Wutiwiwatchai,et al. Thai named-entity recognition using class-based language modeling on multiple-sized subword units , 2008, INTERSPEECH.

[2] Virach Sornlertlamvanich,et al. Thai Speech Corpus for Speech Recognition , 2003 .

[3] Mark Liberman,et al. Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[4] Chai Wutiwiwatchai,et al. A learning method for Thai phonetization of English words , 2007, INTERSPEECH.

[5] Sadaoki Furui,et al. Thai Broadcast News Corpus Construction and Evaluation , 2008, LREC.