Portability Issues for Speech Recognition Technologies

Although there has been regular improvement in speech recognition technology over the past decade, speech recognition is far from being a solved problem. Most recognition systems are tuned to a particular task and porting the system to a new task (or language) still requires substantial investment of time and money, as well as expertise. Todays state-of-the-art systems rely on the availability of large amounts of manually transcribed data for acoustic model training and large normalized text corpora for language model training. Obtaining such data is both time-consuming and expensive, requiring trained human annotators with substantial amounts of supervision.In this paper we address issues in speech recognizer portability and activities aimed at developing generic core speech recognition technology, in order to reduce the manual effort required for system development. Three main axes are pursued: assessing the genericity of wide domain models by evaluating performance under several tasks; investigating techniques for lightly supervised acoustic model training; and exploring transparent methods for adapting generic models to a specific task so as to achieve a higher degree of genericity.

[1]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[2]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[3]  Mark Liberman,et al.  Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[4]  Jean-Luc Gauvain,et al.  Fast decoding for indexation of broadcast data , 2000, INTERSPEECH.

[5]  Jean-Luc Gauvain,et al.  Transcribing Broadcast News: The LIMSI Nov96 Hub4 System , 1997 .

[6]  Mark Liberman,et al.  THE TDT-2 TEXT AND SPEECH CORPUS , 1999 .

[7]  Alexander I. Rudnicky,et al.  Expanding the Scope of the ATIS Task: The ATIS-3 Corpus , 1994, HLT.

[8]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[9]  George Zavaliagkos,et al.  Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[10]  Jonathan G. Fiscus,et al.  1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures , 1998 .

[11]  Jean-Luc Gauvain,et al.  Lightly Supervised Acoustic Model Training , 2000 .

[12]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[13]  Ellen M. Voorhees,et al.  1998 TREC-7 Spoken Document Retrieval Track Overview and Results , 1998 .

[14]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[15]  Jean-Luc Gauvain,et al.  Language modeling for broadcast news transcription , 1999, EUROSPEECH.

[16]  Mitch Weintraub,et al.  The Hub and Spoke Paradigm for CSR Evaluation , 1994, HLT.

[17]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..