Integrating imperfect transcripts into speech recognition systems for building high-quality corpora

Abstract: The training of state-of-the-art automatic speech recognition (ASR) systems requires huge relevant training corpora. The cost of such databases is high and remains a major limitation for the development of speech-enabled applications in particular contexts (e.g. low-density languages or specialized domains). On the other hand, a large amount of data can be found in news prompts, movie subtitles or scripts, etc. The use of such data as training corpus could provide a low-cost solution to the acoustic model estimation problem. Unfortunately, prior transcripts are seldom exact with respect to the content of the speech signal, and suffer from a lack of temporal information. This paper tackles the issue of prompt-based speech corpora improvement, by addressing the problems mentioned above. We propose a method allowing to locate accurate transcript segments in speech signals and automatically correct errors or lack of transcript surrounding these segments. This method relies on a new decoding strategy where the search algorithm is driven by the imperfect transcription of the input utterances. The experiments are conducted on the French language, by using the ESTER database and a set of records (and associated prompts) from RTBF (Radio Television Belge Francophone). The results demonstrate the effectiveness of the proposed approach, in terms of both error correction and text-to-speech alignment.

[1]  Robert L. Mercer,et al.  Automatic speech recognition in machine-aided translation , 1994, Comput. Speech Lang..

[2]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[5]  Alex Waibel,et al.  Flexible transcription alignment , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6]  Eamonn J. Keogh,et al.  Derivative Dynamic Time Warping , 2001, SDM.

[7]  Georges Linarès,et al.  Scalable language model look-ahead for LVCSR , 2005, INTERSPEECH.

[8]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[9]  Gerald Salton,et al.  Automatic text processing , 1988 .

[10]  Tanja Schultz,et al.  Document driven machine translation enhanced ASR , 2005, INTERSPEECH.

[11]  Gianluca Bontempi,et al.  AIDAR: Une architecture pour l'indexation de documents audionumériques , 2006 .

[12]  Ishwar K. Sethi,et al.  Clustering of Imperfect Transcripts Using a Novel Similarity Measure , 2001, SIGIR Workshop: Information Retrieval Techniques for Speech Applications.

[13]  Antonio José Rubio Ayuso,et al.  Speech Recognition and Coding: New Advances and Trends , 1995 .

[14]  Bing Xiang,et al.  Light supervision in acoustic model training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures versus dynamic cache models , 1996, IEEE Trans. Speech Audio Process..

[17]  B. Lecouteux,et al.  Using prompts to produce quality corpus for training automatic speech recognition systems , 2008, MELECON 2008 - The 14th IEEE Mediterranean Electrotechnical Conference.

[18]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[19]  John D. Lafferty,et al.  Cheating with imperfect transcripts , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Guillaume Gravier,et al.  The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[21]  Georges Linarès,et al.  System Combination by Driven Decoding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22]  Georges Linarès,et al.  Imperfect transcript driven speech recognition , 2006, INTERSPEECH.

[23]  Mehryar Mohri Edit-distance of weighted automata , 2002, CIAA'02.

[24]  Jean-Luc Gauvain,et al.  Dynamic language modeling for broadcast news , 2004, INTERSPEECH.

[25]  Pedro J. Moreno,et al.  A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[26]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[27]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[28]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[30]  Mark J. F. Gales,et al.  Unsupervised training and directed manual transcription for LVCSR , 2010, Speech Commun..

[31]  Gérard Chollet Evaluation of ASR Systems, Algorithms and Databases , 1995 .

[32]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[33]  Ciro Martins,et al.  Dynamic language modeling for a daily broadcast news transcription system , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[34]  Gernot Kubin,et al.  Reconstructing Medical Dictations from Automatically Recognized and Non-Literal Transcripts with Phonetic Similarity Matching , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[35]  John R. Kender,et al.  Alignment of Speech to Highly Imperfect Text Transcriptions , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[36]  Jean-Luc Gauvain,et al.  Lightly supervised acoustic model training using consensus networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Jonathan G. Fiscus,et al.  The Rich Transcription 2007 Meeting Recognition Evaluation , 2007, CLEAR.

[38]  Alexander G. Hauptmann,et al.  Improving Acoustic Models by Watching Television , 1998 .

[39]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[40]  Georges Linarès,et al.  On-the-fly term spotting by phonetic filtering and request-driven decoding , 2008, 2008 IEEE Spoken Language Technology Workshop.

[41]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[42]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[43]  Jean-François Bonastre,et al.  ALIZE, a free toolkit for speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[44]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[45]  Patrick Cardinal,et al.  Segmentation of recordings based on partial transcriptions , 2005, INTERSPEECH.

[46]  Georges Linarès,et al.  The LIA Speech Recognition System: From 10xRT to 1xRT , 2007, TSD.

[47]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[48]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  Franz Pernkopf,et al.  Automatic phonetics-driven reconstruction of medical dictations on multiple levels of segmentation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Georges Linarès,et al.  Text island spotting in large speech databases , 2007, INTERSPEECH.

[51]  Michael J. Witbrock,et al.  Using words and phonetic strings for efficient information retrieval from imperfectly transcribed spoken documents , 1997, DL '97.

[52]  Alexander H. Waibel,et al.  Lightly supervised acoustic model training on EPPS recordings , 2008, INTERSPEECH.