Acoustic Model Training with Detecting Transcription Errors in the Training Data

As the target of Automatic Speech Recognition (ASR) has moved from clean read speech to spontaneous conversational speech, we need to prepare orthographic transcripts of spontaneous conversational speech to train acoustic models (AMs). However, it is expensive and slow to manually transcribe such speech word by word. We propose a framework to train an AM based on easy-to-make rough transcripts in which fillers and small word fragments are not precisely transcribed and some transcription errors are included. By focusing on the phone duration in the result of forced alignment between the rough transcripts and the utterances, we can automatically detect the erroneous parts in the rough transcripts. A preliminary experiment showed that we can detect the erroneous parts with moderately high recall and precision. Through ASR experiments with conversational telephone speech, we confirmed that automatic detection helped improve the performance of the AM trained with both conventional ML criteria and state-of-the-art boosted MMI criteria.

[1]  Joost van Doremalen,et al.  Novelty Detection as a Tool for Automatic Detection of Orthographic Transcription Errors , 2009 .

[2]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[3]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[4]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[5]  Tatsuya Kawahara,et al.  Language model transformation applied to lightly supervised training of acoustic model for congress meetings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  H. Ney,et al.  AUTOMATIC TRANSCRIPTION VERIFICATION OF BROADCAST NEWS AND SIMILAR SPEECH CORPORA , 1999 .

[7]  Jing Huang,et al.  The IBM RT06s Evaluation System for Speech Activity Detection in CHIL Seminars , 2006, MLMI.

[8]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Alexander H. Waibel,et al.  Lightly supervised acoustic model training on EPPS recordings , 2008, INTERSPEECH.

[10]  Jean-Luc Gauvain,et al.  Lightly supervised acoustic model training using consensus networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  James R. Glass,et al.  Language model parameter estimation using user transcriptions , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Andreas Stolcke,et al.  The Meeting Project at ICSI , 2001, HLT.

[13]  Bhuvana Ramabhadran,et al.  A new method for OOV detection using hybrid word/fragment system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[15]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[16]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..