A Mixture of Recurrent Neural Networks for Speaker Normalisation

In spite of recent advances in automatic speech recognition, the performance of state-of-the-art speech recognisers fluctuates depending on the speaker. Speaker normalisation aims at the reduction of differences between the acoustic space of a new speaker and the training acoustic space of a given speech recogniser, improving performance. Normalisation is based on an acoustic feature transformation, to be estimated from a small amount of speech signal. This paper introduces a mixture of recurrent neural networks as an effective regression technique to approach the problem. A suitable Vit-erbi-based time alignment procedure is proposed for generating the adaptation set. The mixture is compared with linear regression and single-model connectionist approaches. Speaker-dependent and speaker-independent continuous speech recognition experiments with a large vocabulary, using Hidden Markov Models, are presented. Results show that the mixture improves recognition performance, yielding a 21% relative reduction of the word error rate, i.e. comparable with that obtained with model-adaptation approaches.

[1]  Cesare Furlanello,et al.  Connectionist Speaker Normalization with Generalized Resource Allocating Networks , 1994, NIPS.

[2]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[3]  Michael Picheny,et al.  Robust speaker adaptation using a piecewise linear acoustic mapping , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Yoshua Bengio,et al.  Learning the dynamic nature of speech with back-propagation for sequences , 1992, Pattern Recognit. Lett..

[5]  Raymond L. Watrous Speaker normalization and adaptation using second-order connectionist networks , 1993, IEEE Trans. Neural Networks.

[6]  Giovanni Soda,et al.  Local Feedback Multilayered Networks , 1992, Neural Computation.

[7]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[8]  Peter Regel-Brietzmann,et al.  Fast speaker adaptation for speech recognition systems , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[9]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[10]  Yann LeCun,et al.  Learning processes in an asymmetric threshold network , 1986 .

[11]  Yves Grenier,et al.  Spectral transformations through canonical correlation analysis for speaker adptation in ASR , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[13]  Alex Waibel,et al.  Continuous speech recognition using linked predictive neural networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[14]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[15]  Yunxin Zhao,et al.  An acoustic-phonetic-based speaker adaptation technique for improving speaker-independent continuous speech recognition , 1994, IEEE Trans. Speech Audio Process..

[16]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[17]  James L. McClelland,et al.  Finite State Automata and Simple Recurrent Networks , 1989, Neural Computation.

[18]  T. Hassard,et al.  Applied Linear Regression , 2005 .

[19]  Robert A. Jacobs,et al.  Methods For Combining Experts' Probability Assessments , 1995, Neural Computation.

[20]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[21]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[23]  Yoh-Han Pao,et al.  Adaptive pattern recognition and neural networks , 1989 .

[24]  Xuedong Huang Speaker normalization for speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .

[26]  Richard P. Lippmann,et al.  Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[27]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[28]  H. Matsukoto,et al.  A piecewise linear spectral mapping for supervised speaker adaptation , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[30]  Satoshi Nakamura,et al.  A comparative study of spectral mapping for speaker adaptation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[31]  Chin-Hui Lee,et al.  Robust speech recognition based on stochastic matching , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[32]  Michael I. Jordan Serial Order: A Parallel Distributed Processing Approach , 1997 .

[33]  E. Deprettere SVD and signal processing: algorithms, applications and architectures , 1989 .

[34]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[35]  Cesare Furlanello,et al.  Speaker Normalization and Model Selection of Combined Neural Networks , 1997, Connect. Sci..

[36]  Maurizio Omologo,et al.  Speaker independent continuous speech recognition using an acoustic-phonetic Italian corpus , 1994, ICSLP.

[37]  Giuliano Antoniol,et al.  Language modelling for efficient beam-search , 1995, Comput. Speech Lang..

[38]  A. Waibel,et al.  Connectionist Viterbi training: a new hybrid method for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[39]  Yoshua Bengio,et al.  Neural networks for speech and sequence recognition , 1996 .

[40]  Frank Fallside,et al.  A recurrent error propagation network speech recognition system , 1991 .

[41]  Michael I. Jordan Attractor dynamics and parallelism in a connectionist sequential machine , 1990 .