论文信息 - A Mixture of Recurrent Neural Networks for Speaker Normalisation

A Mixture of Recurrent Neural Networks for Speaker Normalisation

In spite of recent advances in automatic speech recognition, the performance of state-of-the-art speech recognisers fluctuates depending on the speaker. Speaker normalisation aims at the reduction of differences between the acoustic space of a new speaker and the training acoustic space of a given speech recogniser, improving performance. Normalisation is based on an acoustic feature transformation, to be estimated from a small amount of speech signal. This paper introduces a mixture of recurrent neural networks as an effective regression technique to approach the problem. A suitable Vit-erbi-based time alignment procedure is proposed for generating the adaptation set. The mixture is compared with linear regression and single-model connectionist approaches. Speaker-dependent and speaker-independent continuous speech recognition experiments with a large vocabulary, using Hidden Markov Models, are presented. Results show that the mixture improves recognition performance, yielding a 21% relative reduction of the word error rate, i.e. comparable with that obtained with model-adaptation approaches.

Diego Giuliani | Edmondo Trentin | D. Giuliani | E. Trentin

[1] Cesare Furlanello,et al. Connectionist Speaker Normalization with Generalized Resource Allocating Networks , 1994, NIPS.

[2] Robert A. Jacobs,et al. Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[3] Michael Picheny,et al. Robust speaker adaptation using a piecewise linear acoustic mapping , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4] Yoshua Bengio,et al. Learning the dynamic nature of speech with back-propagation for sequences , 1992, Pattern Recognit. Lett..

[5] Raymond L. Watrous. Speaker normalization and adaptation using second-order connectionist networks , 1993, IEEE Trans. Neural Networks.

[6] Giovanni Soda,et al. Local Feedback Multilayered Networks , 1992, Neural Computation.

[7] Simon Haykin,et al. Neural Networks: A Comprehensive Foundation , 1998 .

[8] Peter Regel-Brietzmann,et al. Fast speaker adaptation for speech recognition systems , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[9] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[10] Yann LeCun,et al. Learning processes in an asymmetric threshold network , 1986 .

[11] Yves Grenier,et al. Spectral transformations through canonical correlation analysis for speaker adptation in ASR , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12] Stan Davis,et al. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[13] Alex Waibel,et al. Continuous speech recognition using linked predictive neural networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[14] P. Werbos,et al. Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[15] Yunxin Zhao,et al. An acoustic-phonetic-based speaker adaptation technique for improving speaker-independent continuous speech recognition , 1994, IEEE Trans. Speech Audio Process..

[16] Chin-Hui Lee,et al. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[17] James L. McClelland,et al. Finite State Automata and Simple Recurrent Networks , 1989, Neural Computation.

[18] T. Hassard,et al. Applied Linear Regression , 2005 .

[19] Robert A. Jacobs,et al. Methods For Combining Experts' Probability Assessments , 1995, Neural Computation.

[20] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[21] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22] G. McLachlan. Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[23] Yoh-Han Pao,et al. Adaptive pattern recognition and neural networks , 1989 .

[24] Xuedong Huang. Speaker normalization for speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25] Hervé Bourlard,et al. Connectionist speech recognition , 1993 .

[26] Richard P. Lippmann,et al. Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[27] Ciro Martins,et al. Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[28] H. Matsukoto,et al. A piecewise linear spectral mapping for supervised speaker adaptation , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29] Vassilios Digalakis,et al. Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[30] Satoshi Nakamura,et al. A comparative study of spectral mapping for speaker adaptation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[31] Chin-Hui Lee,et al. Robust speech recognition based on stochastic matching , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[32] Michael I. Jordan. Serial Order: A Parallel Distributed Processing Approach , 1997 .

[33] E. Deprettere. SVD and signal processing: algorithms, applications and architectures , 1989 .

[34] Anders Krogh,et al. Introduction to the theory of neural computation , 1994, The advanced book program.

[35] Cesare Furlanello,et al. Speaker Normalization and Model Selection of Combined Neural Networks , 1997, Connect. Sci..

[36] Maurizio Omologo,et al. Speaker independent continuous speech recognition using an acoustic-phonetic Italian corpus , 1994, ICSLP.

[37] Giuliano Antoniol,et al. Language modelling for efficient beam-search , 1995, Comput. Speech Lang..

[38] A. Waibel,et al. Connectionist Viterbi training: a new hybrid method for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[39] Yoshua Bengio,et al. Neural networks for speech and sequence recognition , 1996 .

[40] Frank Fallside,et al. A recurrent error propagation network speech recognition system , 1991 .

[41] Michael I. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine , 1990 .