Imperfect transcript driven speech recognition

In many cases, textual information can be associated with speech signals such as movie subtitles, theater scenarios, broadcast news summaries etc. This information could be considered as approximated transcripts and corresponds rarely to the exact word utterances. The goal of this work is to use this kind of information to improve the performance of an automatic speech recognition (ASR) system. Multiple applications are possible: to follow a play with closed caption aligned to the voice signal (while respecting to performer variations) to help deaf people, to watch a movie in another language using aligned and corrected closed captions, etc. We propose in this paper a method combining a linguistic analysis of the imperfect transcripts and a dynamic synchronization of these transcripts inside the search algorithm. The proposed technique is based on language model adaptation and on-line synchronization of the search algorithm. Experiments are carried out on an extract of the ESTER evaluation campaign [4] database, using the LIA Broadcast News system. The results show that the transcript-driven system outperforms significantly both the original recognizer and the imperfect transcript itself.

[1]  Georges Linarès,et al.  Scalable language model look-ahead for LVCSR , 2005, INTERSPEECH.

[2]  John D. Lafferty,et al.  Cheating with imperfect transcripts , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Guillaume Gravier,et al.  The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[4]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[5]  Frédéric Béchet,et al.  Le système de transcription du LIA pour ESTER-2005 , 2005 .

[6]  Georges Linarès,et al.  Phoneme Lattice Based A* Search Algorithm for Speech Recognition , 2002, TSD.

[7]  Chih-Wei Huang Automatic Closed Caption Alignment Based on Speech Recognition Transcripts , 2003 .

[8]  Alexander G. Hauptmann,et al.  Improving acoustic models with captioned multimedia speech , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[9]  Jean-Luc Gauvain,et al.  Unsupervised acoustic model training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Pedro J. Moreno,et al.  A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[11]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .