Progress in automatic meeting transcription

In this paper we report recent developments on the meeting transcription task, a large vocabulary conversational speech recognition task. Previous experiments showed this is a very challenging task, with about 50% word error rate (WER) using existing recognizers. The difficulty mostly comes from highly disfluent/conversational nature of meetings, and lack of domain specific training data. For the first problem, our SWB(Switchboard) system — a conversational telephone speech recognizer — was used to recognize wide-band meeting data; for the latter, we leveraged the large amount of Broadcast News (BN) data to build a robust system. This paper will especially focus on two experiments in the BN system development: model combination and HMM topology/duration modeling. Model combination can be done at various stages of recognition: post-processing schemes such as ROVER can lead to significant improvements; to reduce computation we tried model combination at acoustic score level. We will also show the importance of temporal constraints in decoding, present some HMM topology/duration modeling experiments. Finally, the meeting browser system and meeting room setup will be reviewed.

[1]  Andrew K. Halberstadt,et al.  Using aggregation to improve the performance of mixture Gaussian acoustic models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Don McAllaster,et al.  Improvements in recognition of conversational telephone speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[3]  P. Beyerlein Discriminative model combination , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[4]  Robert Malkin,et al.  Experiments in automatic meeting transcription using JRTK , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Alex Waibel,et al.  Flexible transcription alignment , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6]  Richard M. Schwartz,et al.  Duration modeling in large vocabulary speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Michael Picheny,et al.  Context dependent phonetic duration models for decoding conversational speech , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.