Prior-shared feature and model space speaker adaptation by consistently employing map estimation

The purpose of this paper is to describe the development of a speaker adaptation method that improves speech recognition performance regardless of the amount of adaptation data. For that purpose, we propose the consistent employment of a maximum a posteriori (MAP)-based Bayesian estimation for both feature space normalization and model space adaptation. Namely, constrained structural maximum a posteriori linear regression (CSMAPLR) is first performed in a feature space to compensate for the speaker characteristics, and then, SMAPLR is performed in a model space to capture the remaining speaker characteristics. A prior distribution stabilizes the parameter estimation especially when the amount of adaptation data is small. In the proposed method, CSMAPLR and SMAPLR are performed based on the same acoustic model. Therefore, the dimension-dependent variations of feature and model spaces can be similar. Dimension-dependent variations of the transformation matrix are explained well by the prior distribution. Therefore, by sharing the same prior distribution between CSMAPLR and SMAPLR, their parameter estimations can be appropriately regularized in both spaces. Experiments on large vocabulary continuous speech recognition using the Corpus of Spontaneous Japanese (CSJ) and the MIT OpenCourseWare corpus (MIT-OCW) confirm the effectiveness of the proposed method compared with other conventional adaptation methods with and without using speaker adaptive training.

[1]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[2]  Akinori Ito,et al.  Improved Reference Speaker Weighting Using Aspect Model , 2010, IEICE Trans. Inf. Syst..

[3]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[4]  Xiaodong He,et al.  Robust feature space adaptation for telephony speech recognition , 2006, INTERSPEECH.

[5]  Mark J. F. Gales,et al.  Prior information for rapid speaker adaptation , 2010, INTERSPEECH.

[6]  John Makhoul,et al.  Speaker adaptive training: a maximum likelihood approach to speaker normalization , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Mark J. F. Gales,et al.  Incremental Adaptation using Bayesian Inference , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Timothy J. Hazen A comparison of novel techniques for rapid speaker adaptation , 2000, Speech Commun..

[10]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[11]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Akinori Ito,et al.  Aspect-model-based reference speaker weighting , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Chin-Hui Lee,et al.  Joint maximum a posteriori adaptation of transformation and HMM parameters , 2001, IEEE Trans. Speech Audio Process..

[14]  Atsushi Nakamura,et al.  Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Chin-Hui Lee,et al.  Structural maximum a posteriori linear regression for fast HMM adaptation , 2002, Comput. Speech Lang..

[16]  Biing-Hwang Juang,et al.  Bayesian linear regression for Hidden Markov Model based on optimizing variational bounds , 2011, 2011 IEEE International Workshop on Machine Learning for Signal Processing.

[17]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[18]  Biing-Hwang Juang,et al.  A study on speaker adaptation of the parameters of continuous density hidden Markov models , 1991, IEEE Trans. Signal Process..

[19]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[20]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[21]  Takao Kobayashi,et al.  Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis , 2006, INTERSPEECH.

[22]  Mark J. F. Gales,et al.  Bayesian Adaptive Inference and Adaptive Training , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[24]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[25]  George Saon,et al.  Feature and model space speaker adaptation with full covariance Gaussians , 2006, INTERSPEECH.

[26]  Naonori Ueda,et al.  Variational bayesian estimation and clustering for speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[27]  Chin-Hui Lee,et al.  A structural Bayes approach to speaker adaptation , 2001, IEEE Trans. Speech Audio Process..

[28]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[29]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[30]  Wu Chou,et al.  Maximum a posterior linear regression with elliptically symmetric matrix variate priors , 1999, EUROSPEECH.

[31]  Shinji Watanabe,et al.  Predictor–Corrector Adaptation by Using Time Evolution System With Macroscopic Time Scale , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Lin-Shan Lee,et al.  Fast speaker adaptation using eigenspace-based maximum likelihood linear regression , 2000, INTERSPEECH.

[33]  Ludek Müller,et al.  Refinement Approach for Adaptation Based on Combination of MAP and fMLLR , 2009, TSD.

[34]  Jing Huang,et al.  Rapid Feature Space Speaker Adaptation for Multi-Stream HMM-Based Audio-Visual Speech Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[35]  Philip C. Woodland,et al.  Experiments in speaker normalisation and adaptation for large vocabulary speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  Philip C. Woodland Speaker adaptation for continuous density HMMs: a review , 2001 .