Large vocabulary continuous speech recognition of Broadcast News - The Philips/RWTH approach

Abstract Automatic speech recognition of real-live broadcast news (BN) data (Hub-4) has become a challenging research topic in recent years. This paper summarizes our key efforts to build a large vocabulary continuous speech recognition system for the heterogenous BN task without inducing undesired complexity and computational resources. These key efforts included: • automatic segmentation of the audio signal into speech utterances; • efficient one-pass trigram decoding using look-ahead techniques; • optimal log-linear interpolation of a variety of acoustic and language models using discriminative model combination (DMC); • handling short-range and weak longer-range correlations in natural speech and language by the use of phrases and of distance-language models; • improving the acoustic modeling by a robust feature extraction, channel normalization, adaptation techniques as well as automatic script selection and verification. The starting point of the system development was the Philips 64k-NAB word-internal triphone trigram system. On the speaker-independent but microphone-dependent NAB-task (transcription of read newspaper texts) we obtained a word error rate of about 10%. Now, at the conclusion of the system development, we have arrived at Philips at an DMC-interpolated phrase-based crossword-pentaphone 4-gram system. This system transcribes BN data with an overall word error rate of about 17%.

[1]  F. Kubala,et al.  Automatic Speaker Clustering , 1997 .

[2]  Hermann Ney,et al.  Large vocabulary continuous speech recognition of Wall Street Journal data , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Reinhard Kneser,et al.  Statistical language modeling using a variable context length , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  Francis Kubala,et al.  Modeling Those F-Conditions - Or Not , 1997 .

[5]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[7]  Dietrich Klakow,et al.  Log-linear interpolation of language models , 1998, ICSLP.

[8]  Steve J. Young,et al.  A One Pass Decoder Design For Large Vocabulary Recognition , 1994, HLT.

[9]  Reinhold Häb-Umbach,et al.  A study on speaker normalization using vocal tract normalization and speaker adaptive training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[11]  Sirko Molau,et al.  Automatic verification of broadcast news transcriptions , 1999, EUROSPEECH.

[12]  Reinhold Häb-Umbach,et al.  An investigation of cepstral parameterisations for large vocabulary speech recognition , 1999, EUROSPEECH.

[13]  Dietrich Klakow Language-model optimization by mapping of corpora , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[14]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[15]  Hermann Ney,et al.  Improvements in beam search for 10000-word continuous-speech recognition , 1994, IEEE Trans. Speech Audio Process..

[16]  Mark J. F. Gales,et al.  Broadcast news transcription using HTK , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Hermann Ney,et al.  Large vocabulary continuous speech recognition using word graphs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[18]  Peter Beyerlein,et al.  Modelling and decoding of crossword context dependent phones in the Philips large vocabulary continuous speech recognition system , 1997, EUROSPEECH.

[19]  Steve Young,et al.  Segment generation and clustering in the HTK broadcast news transcription system , 1998 .

[20]  Hermann Ney,et al.  Improvements in beam search , 1994, ICSLP.

[21]  Andreas Wendemuth,et al.  Automatic Transcription of English Broadcast News , 1998 .

[22]  Mei-Yuh Hwang,et al.  Improvements on the pronunciation prefix tree search organization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[23]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[24]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[25]  Hermann Ney,et al.  Language-model look-ahead for large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26]  Xavier L. Aubert,et al.  One pass cross word decoding for large vocabularies based on a lexical tree search organization , 1999, EUROSPEECH.

[27]  John S. D. Mason,et al.  On the limitations of cepstral features in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  H. Ney,et al.  Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[30]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[31]  Hermann Ney,et al.  On the Probabilistic Interpretation of Neural Network Classifiers and Discriminative Training Criteria , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Andreas Wendemuth,et al.  The philips/RWTH system for transcription of broadcast news , 1999, EUROSPEECH.

[33]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[34]  R. Schwartz,et al.  THE 1997 BBN BYBLOS SYSTEM APPLIED TO BROADCAST NEWS TRANSCRIPTION , 1998 .

[35]  Hermann Ney,et al.  A comparison of time conditioned and word conditioned search techniques for large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[36]  Peter Beyerlein,et al.  Speaker adaptation in the Philips system for large vocabulary continuous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Jochen Peters,et al.  Capturing Long Range Correlations Using Log-Linear Language Models , 2000 .

[38]  Peter Beyerlein,et al.  Discriminative model combination , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[39]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[40]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.