The ESAT 2008 system for N-Best Dutch speech recognition benchmark

This paper describes the ESAT 2008 Broadcast News transcription system for the N-Best 2008 benchmark, developed in part for testing the recent SPRAAK Speech Recognition Toolkit. ESAT system was developed for the Southern Dutch Broadcast News subtask of N-Best using standard methods of modern speech recognition. A combination of improvements were made in commonly overlooked areas such as text normalization, pronunciation modeling, lexicon selection and morphological modeling, virtually solving the out-of-vocabulary (OOV) problem for Dutch by reducing OOV-rate to 0.06% on the N-Best development data and 0.23% on the evaluation data. Recognition experiments were run with several configurations comparing one-pass vs. two-pass decoding, high-order vs. low-order n-gram models, lexicon sizes and different types of morphological modeling. The system achieved 7.23% word error rate (WER) on the broadcast news development data and 20.3% on the much more difficult evaluation data of N-Best.

[1]  Dirk Van Compernolle,et al.  Reduced semi-continuous models for large vocabulary continuous speech recognition in Dutch , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  David A. van Leeuwen,et al.  N-best: the northern- and southern-dutch benchmark evaluation of speech recognition technology , 2007, INTERSPEECH.

[4]  Patrick Wambacq,et al.  SPRAAK: an open source "SPeech recognition and automatic annotation kit" , 2008, INTERSPEECH.

[5]  Nelleke Oostdijk,et al.  Het Corpus Gesproken Nederlands , 1999 .

[6]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[7]  Mei-Yuh Hwang,et al.  Predicting unseen triphones with senones , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Patrick Wambacq,et al.  Automatic Phonemic Labeling and Segmentation of Spoken Dutch , 2004, LREC.

[9]  Jean-Luc Gauvain,et al.  The Joint LIMSI and Vecsys Research Systems for NBEST 2008 , 2008 .

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  Mathias Creutz,et al.  INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT , 2005 .

[12]  Dirk Van Compernolle,et al.  A static lexicon network representation for cross-word context dependent phones , 1997, EUROSPEECH.

[13]  Kris Demuynck,et al.  Extracting, modelling and combining information in speech recognition , 2001 .

[14]  Jean-Pierre Martens,et al.  Reducing speech recognition time and memory use by means of compound (de-)composition , 2008 .

[15]  Jan Odijk,et al.  The Dutch-Flemish HLT Programme STEVIN: Essential Speech and Language Technology Resources , 2006, LREC.

[16]  Kris Demuynck,et al.  A flexible recogniser architecture in a reading tutor for children , 2006 .

[17]  Jacques Duchateau,et al.  HMM based acoustic modelling in large vocabulary speech recognition , 1998 .

[18]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.