The paper presents some recent work on using consensus networks to improve lightly supervised acoustic model training for the LIMSI Mandarin BN system. Lightly supervised acoustic model training has been attracting growing interest, since it can help to reduce the development costs for speech recognition systems substantially. Compared to supervised training with accurate transcriptions, the key problem in lightly supervised training is getting the approximate transcripts to be as close as possible to manually produced detailed ones, i.e., finding a proper way to provide the information for supervision. Previous work using a language model to provide supervision has been quite successful. The paper extends the original method by presenting a new way to get the information needed for supervision during training. Studies are carried out using the TDT4 Mandarin audio corpus and associated closed-captions. After automatically recognizing the training data, the closed-captions are aligned with a consensus network derived from the hypothesized lattices. As is the case with closed-caption filtering, this method can remove speech segments whose automatic transcripts contain errors, but it can also recover errors in the hypothesis if the information is present in the lattice. Experimental results show that, compared with simply training on all of the data, consensus network based lightly supervised acoustic model training results in a small reduction in the character error rate on the DARPA/NIST RT'03 development and evaluation data.
[1]
Jean-Luc Gauvain,et al.
Transcribing Mandarin broadcast news
,
2003,
2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).
[2]
Ellen M. Voorhees,et al.
1998 TREC-7 Spoken Document Retrieval Track Overview and Results
,
1998
.
[3]
Andreas Stolcke,et al.
Finding consensus among words: lattice-based word error minimization
,
1999,
EUROSPEECH.
[4]
Jean-Luc Gauvain,et al.
The LIMSI Broadcast News transcription system
,
2002,
Speech Commun..
[5]
Jean-Luc Gauvain,et al.
Lightly supervised and unsupervised acoustic model training
,
2002,
Comput. Speech Lang..
[6]
Ellen M. Voorhees,et al.
The TREC Spoken Document Retrieval Track: A Success Story
,
2000,
TREC.
[7]
Hermann Ney,et al.
Unsupervised training of acoustic models for large vocabulary continuous speech recognition
,
2005,
IEEE Transactions on Speech and Audio Processing.
[8]
Alexander H. Waibel,et al.
Unsupervised training of a speech recognizer: recent experiments
,
1999,
EUROSPEECH.
[9]
H. Ney,et al.
AUTOMATIC TRANSCRIPTION VERIFICATION OF BROADCAST NEWS AND SIMILAR SPEECH CORPORA
,
1999
.