Experiments on Adaptation Methods to Improve Acoustic Modeling for French Speech Recognition

To improve the performance of Automatic Speech Recognition (ASR) systems, the models must be retrained in order to better adjust to the speakerâ??s voice characteristics, the environmental and channel conditions or the context of the task. In this project we focus on the mismatch between the acoustic features used to train the model and the vocal characteristics of the front-end user of the system. To overcome this mismatch, speaker adaptation techniques have been used. A significant performance improvement has been shown using using constrained Maximum Likelihood Linear Regression (cMLLR) model adaptation methods, while a fast adaptation is guaranteed by using linear Vocal Tract Length Normalization (lVTLN).We have achieved a relative gain of approximately 9.44% in the word error rate with unsupervised cMLLR adaptation. We also compare our ASR system with the Google ASR and show that, using adaptation methods, we exceed its performance.

[1]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[2]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[3]  Guillaume Gravier,et al.  The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.

[4]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[5]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[6]  Koichi Shinoda,et al.  Structural MAP speaker adaptation using hierarchical priors , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[7]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[8]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[9]  Philip C. Woodland,et al.  Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models , 1997, Comput. Speech Lang..

[10]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[11]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[14]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Srinivasan Umesh,et al.  Improved cepstral mean and variance normalization using Bayesian framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.