论文信息 - Large vocabulary continuous speech recognition of Broadcast News - The Philips/RWTH approach

Large vocabulary continuous speech recognition of Broadcast News - The Philips/RWTH approach

Abstract Automatic speech recognition of real-live broadcast news (BN) data (Hub-4) has become a challenging research topic in recent years. This paper summarizes our key efforts to build a large vocabulary continuous speech recognition system for the heterogenous BN task without inducing undesired complexity and computational resources. These key efforts included: • automatic segmentation of the audio signal into speech utterances; • efficient one-pass trigram decoding using look-ahead techniques; • optimal log-linear interpolation of a variety of acoustic and language models using discriminative model combination (DMC); • handling short-range and weak longer-range correlations in natural speech and language by the use of phrases and of distance-language models; • improving the acoustic modeling by a robust feature extraction, channel normalization, adaptation techniques as well as automatic script selection and verification. The starting point of the system development was the Philips 64k-NAB word-internal triphone trigram system. On the speaker-independent but microphone-dependent NAB-task (transcription of read newspaper texts) we obtained a word error rate of about 10%. Now, at the conclusion of the system development, we have arrived at Philips at an DMC-interpolated phrase-based crossword-pentaphone 4-gram system. This system transcribes BN data with an overall word error rate of about 17%.

[1] F. Kubala,et al. Automatic Speaker Clustering , 1997 .

[2] Hermann Ney,et al. Large vocabulary continuous speech recognition of Wall Street Journal data , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Reinhard Kneser,et al. Statistical language modeling using a variable context length , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4] Francis Kubala,et al. Modeling Those F-Conditions - Or Not , 1997 .

[5] Jonathan G. Fiscus,et al. A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6] H Hermansky,et al. Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[7] Dietrich Klakow,et al. Log-linear interpolation of language models , 1998, ICSLP.

[8] Steve J. Young,et al. A One Pass Decoder Design For Large Vocabulary Recognition , 1994, HLT.

[9] Reinhold Häb-Umbach,et al. A study on speaker normalization using vocal tract normalization and speaker adaptive training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10] A. B.,et al. SPEECH COMMUNICATION , 2001 .

[11] Sirko Molau,et al. Automatic verification of broadcast news transcriptions , 1999, EUROSPEECH.

[12] Reinhold Häb-Umbach,et al. An investigation of cepstral parameterisations for large vocabulary speech recognition , 1999, EUROSPEECH.

[13] Dietrich Klakow. Language-model optimization by mapping of corpora , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[14] Hermann Ney,et al. A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[15] Hermann Ney,et al. Improvements in beam search for 10000-word continuous-speech recognition , 1994, IEEE Trans. Speech Audio Process..

[16] Mark J. F. Gales,et al. Broadcast news transcription using HTK , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17] Hermann Ney,et al. Large vocabulary continuous speech recognition using word graphs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[18] Peter Beyerlein,et al. Modelling and decoding of crossword context dependent phones in the Philips large vocabulary continuous speech recognition system , 1997, EUROSPEECH.

[19] Steve Young,et al. Segment generation and clustering in the HTK broadcast news transcription system , 1998 .

[20] Hermann Ney,et al. Improvements in beam search , 1994, ICSLP.

[21] Andreas Wendemuth,et al. Automatic Transcription of English Broadcast News , 1998 .

[22] Mei-Yuh Hwang,et al. Improvements on the pronunciation prefix tree search organization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[23] Li Lee,et al. Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[24] Ronald Rosenfeld,et al. Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[25] Hermann Ney,et al. Language-model look-ahead for large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26] Xavier L. Aubert,et al. One pass cross word decoding for large vocabularies based on a lexical tree search organization , 1999, EUROSPEECH.

[27] John S. D. Mason,et al. On the limitations of cepstral features in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[28] H. Ney,et al. Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29] J. Darroch,et al. Generalized Iterative Scaling for Log-Linear Models , 1972 .

[30] S. Chen,et al. Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[31] Hermann Ney,et al. On the Probabilistic Interpretation of Neural Network Classifiers and Discriminative Training Criteria , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[32] Andreas Wendemuth,et al. The philips/RWTH system for transcription of broadcast news , 1999, EUROSPEECH.

[33] Wolfgang Wahlster,et al. Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[34] R. Schwartz,et al. THE 1997 BBN BYBLOS SYSTEM APPLIED TO BROADCAST NEWS TRANSCRIPTION , 1998 .

[35] Hermann Ney,et al. A comparison of time conditioned and word conditioned search techniques for large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[36] Peter Beyerlein,et al. Speaker adaptation in the Philips system for large vocabulary continuous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37] Jochen Peters,et al. Capturing Long Range Correlations Using Log-Linear Language Models , 2000 .

[38] Peter Beyerlein,et al. Discriminative model combination , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[39] Jj Odell,et al. The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[40] Steve J. Young,et al. Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.