Interpolation of acoustic models for speech recognition

Acoustic models for speech recognition are often trained on data coming from a variety of sources. The usual approach is to pool together all of the available training data, considering them all to be part of a unique training set. In this work, assuming that each source may have a different degree of relevance for a given target task, two techniques are proposed to weigh subsets of the training data. The first one is based on the interpolation of the model probability densities, while the other on data weighting. An method to automatically select the mixture coefficients is also proposed. The best technique presented here outperformed unsupervised MAP adaptation and led to improvements in word accuracy (up to 6% relative) over the pooled model.

[1]  Hong Kook Kim,et al.  Acoustic Model Adaptation Based on Pronunciation Variability Analysis for Non-Native Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Marcello Federico,et al.  Language Model Adaptation , 1999 .

[3]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[4]  B.-H. Juang,et al.  Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains , 1985, AT&T Technical Journal.

[5]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  George Zavaliagkos,et al.  Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[7]  Volker Fischer,et al.  Multilingual acoustic models for the recognition of non-native speech , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[8]  Jean-Luc Gauvain,et al.  Lattice-based unsupervised acoustic model training , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Steve Young,et al.  Corpus-based methods in language and speech processing , 1997 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Tien Ping Tan,et al.  Acoustic Model Interpolation for Non-Native Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Jean-Luc Gauvain,et al.  Conversational telephone speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[13]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[14]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[15]  R. Schwartz,et al.  Maximum a posteriori adaptation for large scale HMM recognizers , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16]  J. Yamagishi,et al.  HMM-Based Speech Synthesis with Various Speaking Styles Using Model Interpolation , 2004 .

[17]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[18]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[19]  Jean-Luc Gauvain,et al.  Incorporating MLP features in the unsupervised training process , 2012, SLTU.

[20]  Dimitra Vergyri,et al.  Automatic speech recognition of multiple accented English data , 2010, INTERSPEECH.