The AMI System for the Transcription of Speech in Meetings

This paper describes the AMI transcription system for speech in meetings developed in collaboration by five research groups. The system includes generic techniques such as discriminative and speaker adaptive training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, maximum likelihood linear regression, and phone posterior based features, as well as techniques specifically designed for meeting data. These include segmentation and cross-talk suppression, beam-forming, domain adaptation, Web-data collection, and channel adaptive training. The system was improved by more than 20% relative in word error rate compared to our previous system and was used in the NIST RT106 evaluations where it was found to yield competitive performance.

[1]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[2]  Henry Cox,et al.  Practical supergain , 1986, IEEE Trans. Acoust. Speech Signal Process..

[3]  Digital voice echo canceller with a TMS32020 , 1987 .

[4]  Henry Cox,et al.  Robust adaptive beamforming , 2005, IEEE Trans. Acoust. Speech Signal Process..

[5]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[6]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[7]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[9]  Mark J. F. Gales,et al.  Broadcast news transcription using HTK , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Andreas G. Andreou,et al.  Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition , 1997 .

[11]  Thomas Niesler,et al.  The 1998 HTK system for transcription of conversational telephone speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12]  Hagen Soltau,et al.  The ISL Meeting Room System , 2001 .

[13]  Daniel P. W. Ellis,et al.  Hidden Markov Model Based Speech Activity Detection for the ICSI Meeting Project , 2001 .

[14]  Susanne Burger,et al.  The ISL meeting corpus: the impact of meeting type on speech style , 2002, INTERSPEECH.

[15]  Mark J. F. Gales,et al.  MMI-MAP and MPE-MAP for acoustic model adaptation , 2003, INTERSPEECH.

[16]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[17]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[18]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[19]  Martial Michel,et al.  The NIST Meeting Room Pilot Corpus , 2004, LREC.

[20]  Lukás Burget,et al.  Combination of speech features using smoothed heteroscedastic linear discriminant analysis , 2004, INTERSPEECH.

[21]  Pavel Matejka,et al.  Towards Lower Error Rates in Phoneme Recognition , 2004, TSD.

[22]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[23]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[24]  Roeland Ordelman,et al.  Transcription of conference room meetings: an investigation , 2005, INTERSPEECH.

[25]  Andreas Stolcke,et al.  Further Progress in Meeting Recognition: The ICSI-SRI Spring 2005 Speech-to-Text Evaluation System , 2005, MLMI.

[26]  Lukás Burget,et al.  The 2005 AMI System for the Transcription of Speech in Meetings , 2005, MLMI.

[27]  Jithendra Vepa,et al.  The segmentation of multi-channel meeting recordings for automatic speech recognition , 2006, INTERSPEECH.

[28]  Lukás Burget,et al.  The AMI Meeting Transcription System: Progress and Performance , 2006, MLMI.

[29]  Thomas Hain,et al.  Strategies for Language Model Web-Data Collection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[30]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Lukás Burget,et al.  The AMI System for the Transcription of Speech in Meetings , 2007, ICASSP.

[32]  Andreas Stolcke,et al.  PROGRESS IN MEETING RECOGNITION: THE ICSI-SRI-UW SPRING 2004 EVALUATION SYSTEM , 2008 .