Rapid speaker adaptation in eigenvoice space

This paper describes a new model-based speaker adaptation algorithm called the eigenvoice approach. The approach constrains the adapted model to be a linear combination of a small number of basis vectors obtained offline from a set of reference speakers, and thus greatly reduces the number of free parameters to be estimated from adaptation data. These "eigenvoice" basis vectors are orthogonal to each other and guaranteed to represent the most important components of variation between the reference speakers. Experimental results for a small-vocabulary task (letter recognition) given in the paper show that the approach yields major improvements in performance for tiny amounts of adaptation data. For instance, we obtained 16% relative improvement in error rate with one letter of supervised adaptation data, and 26% relative improvement with four letters of supervised adaptation data. After a comparison of the eigenvoice approach with other speaker adaptation algorithms, the paper concludes with a discussion of future work.

[1]  Seyed Mohammad Ahadi-Sarkani Bayesian and predictive techniques for speaker adaptation , 1996 .

[2]  Richard M. Stern,et al.  Speaker adaptation in continuous speech recognition via estimation of correlated mean vectors , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Roland Kuhn,et al.  Eigenvoices for speaker adaptation , 1998, ICSLP.

[4]  Chin-Hui Lee,et al.  Bayesian learning for hidden Markov model with Gaussian mixture state observation densities , 1991, Speech Commun..

[5]  Alex Acero,et al.  Speaker and gender normalization for continuous-density hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[7]  Philip C. Woodland,et al.  Speaker adaptation of HMMs using linear regression , 1994 .

[8]  Lawrence Sirovich,et al.  Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Roland Kuhn,et al.  Eigenfaces and eigenvoices: dimensionality reduction for specialized pattern recognition , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[10]  S. Furui,et al.  Unsupervised speaker adaptation method based on hierarchical spectral clustering , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[11]  Tetsuo Kosaka,et al.  Speaker adaptation based on transfer vector field smoothing using maximum a posteriori probability estimation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[12]  Roland Kuhn,et al.  Fast speaker adaptation using a priori knowledge , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[13]  Philip C. Woodland,et al.  Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models , 1997, Comput. Speech Lang..

[14]  A. Imamura Speaker-adaptive HMM-based speech recognition with a stochastic speaker classifier , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[15]  Paul A. Griffin,et al.  Statistical Approach to Shape from Shading: Reconstruction of Three-Dimensional Face Surfaces from Single Two-Dimensional Images , 1996, Neural Computation.

[16]  James R. Glass,et al.  A comparison of novel techniques for instantaneous speaker adaptation , 1997, EUROSPEECH.

[17]  Mark J. F. Gales Cluster adaptive training for speech recognition , 1998, ICSLP.

[18]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[19]  Bin Ma,et al.  Irrelevant variability normalization in learning HMM state tying from data based on phonetic decision-tree , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[20]  Tetsuo Kosaka,et al.  Tree-structured speaker clustering for speaker-independent continuous speech recognition , 1994, ICSLP.

[21]  Timothy J. Hazen The use of speaker correlation information for automatic speech recognition , 1998 .

[22]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[23]  Ron Cole,et al.  The ISOLET spoken letter database , 1990 .

[24]  Richard M. Stern,et al.  A Posteriori Estimation of Correlated Jointly Gaussian Mean Vectors , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  J. S. Bridle,et al.  An approach to speech recognition using synthesis by rule , 1986 .

[26]  Jean-Claude Junqua,et al.  Maximum likelihood eigenspace and MLLR for speech recognition in noisy environments , 1999, EUROSPEECH.

[27]  Stephen Cox,et al.  Predictive speaker adaptation in speech recognition , 1995, Comput. Speech Lang..

[28]  Richard M. Stern,et al.  Dynamic speaker adaptation for feature-based isolated word recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[29]  R. Schwartz,et al.  Maximum a posteriori adaptation for large scale HMM recognizers , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[30]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[31]  John Makhoul,et al.  Speaker adaptive training: a maximum likelihood approach to speaker normalization , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Alex Pentland,et al.  Probabilistic Visual Learning for Object Representation , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[34]  N. Deshmukh,et al.  Decision Tree-Based State Tying For Acoustic Modeling , 1996 .

[35]  J. Atick,et al.  Statistical Approach to Shape from Shading : Reconstruction of 3 D Face Surfaces from Single 2 D , 1997 .

[36]  Laurent Miclet,et al.  Speaker hierarchical clustering for improving speaker-independent HMM word recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[37]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[38]  Biing-Hwang Juang,et al.  A study on speaker adaptation of the parameters of continuous density hidden Markov models , 1991, IEEE Trans. Signal Process..

[39]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[40]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[41]  J AtickJoseph,et al.  Statistical approach to shape from shading , 1996 .

[42]  Michael Picheny,et al.  Speaker clustering and transformation for speaker adaptation in speech recognition systems , 1998, IEEE Trans. Speech Audio Process..

[43]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..