The 2005 AMI System for the Transcription of Speech in Meetings

In this paper we describe the 2005 AMI system for the transcription of speech in meetings used in the 2005 NIST RT evaluations. The system was designed for participation in the speech to text part of the evaluations, in particular for transcription of speech recorded with multiple distant microphones and independent headset microphones. System performance was tested on both conference room and lecture style meetings. Although input sources are processed using different front-ends, the recognition process is based on a unified system architecture. The system operates in multiple passes and makes use of state of the art technologies such as discriminative training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, speaker adaptation with maximum likelihood linear regression and minimum word error rate decoding. In this paper we describe the system performance on the official development and test sets for the NIST RT05s evaluations. The system was jointly developed in less than 10 months by a multi-site team and was shown to achieve competitive performance.

[1]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[2]  Henry Cox,et al.  Practical supergain , 1986, IEEE Trans. Acoust. Speech Signal Process..

[3]  Digital voice echo canceller with a TMS32020 , 1987 .

[4]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[5]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[6]  Mark J. F. Gales,et al.  Broadcast news transcription using HTK , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Andreas G. Andreou,et al.  Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition , 1997 .

[8]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[9]  Daniel P. W. Ellis,et al.  Hidden Markov Model Based Speech Activity Detection for the ICSI Meeting Project , 2001 .

[10]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Susanne Burger,et al.  The ISL meeting corpus: the impact of meeting type on speech style , 2002, INTERSPEECH.

[12]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[13]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Lukás Burget,et al.  Combination of speech features using smoothed heteroscedastic linear discriminant analysis , 2004, INTERSPEECH.

[15]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[16]  Thomas Hain,et al.  Implicit modelling of pronunciation variation in automatic speech recognition , 2005, Speech Commun..

[17]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[18]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[19]  Lukás Burget,et al.  The AMI System for the Transcription of Speech in Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.