An experimental study on structural-MAP approaches to implementing very large vocabulary speech recognition systems for real-world tasks

In this paper we present an experimental study exploiting structural Bayesian adaptation for handling potential mismatches between training and test conditions for real-world applications to be realized in our multilingual very large vocabulary speech recognition (VLVSR) system project sponsored by MOTIE (The Ministry of Trade, Industry and Energy), Republic of Korea. The goal of the project is to construct a national-wide VLVSR cloud service platform for mobile applications. Besides system architecture design issues, at such a large scale, performance robustness problems, caused by mismatches in speakers, tasks, environments, and domains, etc., need to be taken into account very carefully as well. We decide to adopt adaptation, especially the structural MAP, techniques to reduce system accuracy degradation caused by these mismatches. Being part of an ongoing project, we describe how structural MAP approaches can be used for adaptation of both acoustic and language models for our VLVSR systems, and provide convincing experimental results to demonstrate how adaptation can be utilized to bridge the performance gap between the current state-of-the-art and deployable VLVSR systems.

[1]  Biing-Hwang Juang,et al.  Pattern recognition using a family of design algorithms based upon the generalized probabilistic descent method , 1998, Proc. IEEE.

[2]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..

[3]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[5]  Chin-Hui Lee,et al.  Vocabulary independent discriminative utterance verification for non-keyword rejection in subword based speech recognition , 1998 .

[6]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[7]  Chin-Hui Lee,et al.  A structural Bayes approach to speaker adaptation , 2001, IEEE Trans. Speech Audio Process..

[8]  Mark A. Clements,et al.  Using observation uncertainty in HMM decoding , 2002, INTERSPEECH.

[9]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[10]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition , 1996 .

[11]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[12]  Chin-Hui Lee,et al.  Bayesian adaptive learning of the parameters of hidden Markov model for speech recognition , 1995, IEEE Trans. Speech Audio Process..

[13]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[14]  Mari Ostendorf,et al.  A stochastic segment model for phoneme-based continuous speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[15]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[16]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[17]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[18]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[19]  Myoung-Wan Koo,et al.  Speech recognition and utterance verification based on a generalized confidence score , 2001, IEEE Trans. Speech Audio Process..

[20]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[21]  Steve Young,et al.  Parallel model combination for speech recognition in noise , 1993 .

[22]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[23]  Biing-Hwang Juang,et al.  Maximum likelihood estimation for multivariate mixture observations of markov chains , 1986, IEEE Trans. Inf. Theory.

[24]  Chin-Hui Lee,et al.  Bayesian learning for hidden Markov model with Gaussian mixture state observation densities , 1991, Speech Commun..

[25]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[26]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27]  Biing-Hwang Juang,et al.  Key-phrase detection and verification for flexible speech understanding , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[28]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[29]  Biing-Hwang Juang,et al.  A study on speaker adaptation of the parameters of continuous density hidden Markov models , 1991, IEEE Trans. Signal Process..

[30]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[31]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[32]  Vegard Gulaker Speech Recognition by Human and Machine , 2010 .

[33]  Chin-Hui Lee,et al.  Robust speech recognition based on adaptive classification and decision strategies , 2000, Speech Commun..

[34]  Jen-Tzung Chien,et al.  Structural Bayesian language modeling and adaptation , 2007, INTERSPEECH.

[35]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[36]  Biing-Hwang Juang,et al.  Discriminative utterance verification for connected digits recognition , 1995, IEEE Trans. Speech Audio Process..

[37]  Chin-Hui Lee,et al.  On-line adaptive learning of the correlated continuous density hidden Markov models for speech recognition , 1998, IEEE Trans. Speech Audio Process..

[38]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Chin-Hui Lee,et al.  Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition , 1996, IEEE Trans. Speech Audio Process..

[40]  Biing-Hwang Juang,et al.  Flexible speech understanding based on combined key-phrase detection and verification , 1998, IEEE Trans. Speech Audio Process..

[41]  Chin-Hui Lee,et al.  Nonlinear compensation for stochastic matching , 1999, IEEE Trans. Speech Audio Process..

[42]  Chin-Hui Lee,et al.  On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate , 1997, IEEE Trans. Speech Audio Process..

[43]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[44]  Chin-Hui Lee,et al.  On stochastic feature and model compensation approaches to robust speech recognition , 1998, Speech Commun..

[45]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[46]  Chin-Hui Lee,et al.  Unsupervised adaptation using structural Bayes approach , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[47]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[48]  Chin-Hui Lee,et al.  Acoustic modeling for large vocabulary speech recognition , 1990 .

[49]  Qiang Huo,et al.  On adaptive decision rules and decision parameter adaptation for automatic speech recognition , 2000, Proceedings of the IEEE.

[50]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition: Fundamentals and Applications , 1995 .