Recent improvements in the CU Sonic ASR system for noisy speech: the SPINE task

We report on recent improvements in the University of Colorado system for the DARPA/NRL Speech in Noisy Environments (SPINE) task. In particular, we describe our efforts on improving acoustic and language modeling for the task and investigate methods for unsupervised speaker and environment adaptation from limited data. We show that the MAPLR adaptation method outperforms single and multiple regression class MLLR on the SPINE task. Our current SPINE system uses the Sonic speech recognition engine that was developed at the University of Colorado. This system is shown to have a word error rate of 31.5% on the SPINE-2 evaluation data. These improvements amount to a 16% reduction in relative word error rate compared to our previous SPINE-2 system fielded in the November 2001 DARPA/NRL evaluation.

[1]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[2]  Jeff A. Bilmes,et al.  The 2001 GMTK-based SPINE ASR system , 2002, INTERSPEECH.

[3]  George Saon,et al.  Data-driven approach to designing compound words for continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[4]  Richard M. Stern,et al.  Speech in Noisy Environments: robust automatic segmentation, feature extraction, and hypothesis combination , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Hong Kook Kim,et al.  Robust speech recognition techniques applied to a speech in noise task , 2001, INTERSPEECH.

[6]  Andreas Stolcke,et al.  Building an ASR system for noisy environments: SRI's 2001 SPINE evaluation system , 2002, INTERSPEECH.

[7]  John H. L. Hansen,et al.  Robust speech recognition in noise: an evaluation using the SPINE corpus , 2001, INTERSPEECH.

[8]  Brian Kingsbury,et al.  Robust speech recognition in Noisy Environments: The 2001 IBM spine evaluation system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Wu Chou,et al.  Robust decision tree state tying for continuous speech recognition , 2000, IEEE Trans. Speech Audio Process..

[10]  Chin-Hui Lee,et al.  Joint maximum a posteriori adaptation of transformation and HMM parameters , 2001, IEEE Trans. Speech Audio Process..

[11]  Chin-Hui Lee,et al.  Structural maximum a posteriori linear regression for fast HMM adaptation , 2002, Comput. Speech Lang..

[12]  Elaine Marsh,et al.  Speech in noisy environments (spine) adds new dimension to speech recognition R&D , 2002 .

[13]  Yonghong Yan,et al.  Run time information fusion in speech recognition , 2002, INTERSPEECH.

[14]  Hermann Ney,et al.  Speaker adaptive modeling by vocal tract normalization , 2002, IEEE Trans. Speech Audio Process..