Accent adaptation using Subspace Gaussian Mixture Models

This paper investigates employment of Subspace Gaussian Mixture Models (SGMMs) for acoustic model adaptation towards different accents for English speech recognition. The SGMMs comprise globally-shared and state-specific parameters which can efficiently be employed for various kinds of acoustic parameter tying. Research results indicate that well-defined sharing of acoustic model parameters in SGMMs can significantly outperform adapted systems based on conventional HMM/GMMs. Furthermore, SGMMs rapidly achieve target acoustic models with small amounts of data. Experiments performed with US and UK English versions of the Wall Street Journal (WSJ) corpora indicate that SGMMs lead to approximately 20% and 8% relative improvements with respect to speaker-independent and speaker-adapted acoustic models respectively over conventional HMM/GMMs. Finally, we demonstrate that SGMMs adapted only with 1.5 hours can reach performance of HMM/GMMs trained with 18 hours.

[1]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Alex Waibel,et al.  Adaptation Methods For Non-Native Speech , 2001 .

[4]  Yi Su,et al.  Accent detection and speech recognition for Shanghai-accented Mandarin , 2005, INTERSPEECH.

[5]  Jean-Luc Gauvain,et al.  Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Petr Motlícek,et al.  Comparing different acoustic modeling techniques for multilingual boosting , 2012, INTERSPEECH.

[7]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[8]  Liang Lu,et al.  Maximum a posteriori adaptation of subspace Gaussian mixture models for cross-lingual speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Slim Abdennadher,et al.  Cross-lingual acoustic modeling for dialectal Arabic speech recognition , 2010, INTERSPEECH.

[11]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[12]  Steve Young,et al.  WSJCAM0 corpus and recording description , 1994 .

[13]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Daniel Povey,et al.  A symmetrization of the Subspace Gaussian Mixture Model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[18]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..