论文信息 - Automatic Phonemic Labeling and Segmentation of Spoken Dutch

Automatic Phonemic Labeling and Segmentation of Spoken Dutch

The CGN corpus (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of the phonemic annotations and the corresponding segmentations. First, we detail the processes used to generate possible pronunciations for each sentence and to select to most likely one. Next, we identify the remaining difficulties when handling the CGN data and explain how we solved them. We conclude with an evaluation of the quality of the resulting transcriptions and segmentations.

Patrick Wambacq | Dirk Van Compernolle | Kris Demuynck | Tom Laureys

[1] Kris Demuynck,et al. Automatic generation of phonetic transcriptions for large speech corpora , 2002, INTERSPEECH.

[2] Alan W. Black,et al. Letter to sound rules for accented lexicon compression , 1998, ICSLP.

[3] G. Booij. The Phonology of Dutch , 1995 .

[4] Jean-Pierre Martens,et al. Word Segmentation in the Spoken Dutch Corpus , 2002, LREC.

[5] Dirk Van Compernolle,et al. The phonological rules of Dutch , 1995 .

[6] Susan Stewart,et al. Letter on Sound , 1998 .

[7] F. V. Eynde. for the Spoken Dutch Corpus , 2000 .

[8] Hermann Ney,et al. Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[9] Kris Demuynck,et al. Extracting, modelling and combining information in speech recognition , 2001 .