Hidden Markov Acoustic Modeling With Bootstrap and Restructuring for Low-Resourced Languages

This paper proposes an acoustic modeling approach based on bootstrap and restructuring to dealing with data sparsity for low-resourced languages. The goal of the approach is to improve the statistical reliability of acoustic modeling for automatic speech recognition (ASR) in the context of speed, memory and response latency requirements for real-world applications. In this approach, randomized hidden Markov models (HMMs) estimated from the bootstrapped training data are aggregated for reliable sequence prediction. The aggregation leads to an HMM with superior prediction capability at cost of a substantially larger size. For practical usage the aggregated HMM is restructured by Gaussian clustering followed by model refinement. The restructuring aims at reducing the aggregated HMM to a desirable model size while maintaining its performance close to the original aggregated HMM. To that end, various Gaussian clustering criteria and model refinement algorithms have been investigated in the full covariance model space before the conversion to the diagonal covariance model space in the last stage of the restructuring. Large vocabulary continuous speech recognition (LVCSR) experiments on Pashto and Dari have shown that acoustic models obtained by the proposed approach can yield superior performance over the conventional training procedure with almost the same run-time memory consumption and decoding speed.

[1]  Xiaodong Cui,et al.  Acoustic Modeling with Bootstrap and Restructuring Based on Full Covariance , 2011, INTERSPEECH.

[2]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[3]  M. Padmanabhan,et al.  Model complexity adaptation using a discriminant measure , 2000, IEEE Trans. Speech Audio Process..

[4]  Elizabeth C. Botha,et al.  Cross-language use of acoustic information for automatic speech recognition , 2002, Speech Commun..

[5]  B. Efron Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[6]  Vaibhava Goel,et al.  Refactoring acoustic models using variational density approximation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Georges Linarès,et al.  Structural speaker adaptation using maximum a posteriori approach and a Gaussian distributions merging technique , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Laurent Besacier,et al.  Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Mark J. F. Gales,et al.  Automatic Model Complexity Control Using Marginalized Discriminative Growth Functions , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Christian Hennig,et al.  Methods for merging Gaussian mixture components , 2010, Adv. Data Anal. Classif..

[11]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[12]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[14]  Gutti Jogesh Babu Bootstrap Techniques for Signal Processing , 2005, Technometrics.

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  Etienne Barnard,et al.  Pooling ASR data for closely related languages , 2010, SLTU.

[17]  A. Nadas,et al.  Decoder selection based on cross-entropies , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[18]  Faisal Zaman,et al.  Effect of Subsampling Rate on Subbagging and Related Ensembles of Stable Classifiers , 2009, PReMI.

[19]  Thomas Pellegrini,et al.  Using phonetic features in unsupervised word decompounding for ASR with application to a less-represented language , 2007, INTERSPEECH.

[20]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[21]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Ioannis Dologlou,et al.  A new approach to merging Gaussian densities in large vocabulary continuous speech recognition , 1998 .

[24]  Koichi Shinoda,et al.  Speaker adaptation with autonomous model complexity control by MDL principle , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[25]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  G. B. Varile Multilingual Speech Processing , 2005 .

[27]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[28]  John R. Hershey,et al.  Variational Bhattacharyya divergence for hidden Markov models , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Massimiliano Pontil,et al.  Stability of Randomized Learning Algorithms , 2005, J. Mach. Learn. Res..

[30]  Geoffrey Zweig,et al.  Anatomy of an extremely fast LVCSR decoder , 2005, INTERSPEECH.

[31]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[32]  Vaibhava Goel,et al.  Refactoring acoustic models using variational expectation-maximization , 2009, INTERSPEECH.

[33]  Fábio Violaro,et al.  Gaussian elimination algorithm for HMM complexity reduction in continuous speech recognition systems , 2005, INTERSPEECH.

[34]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[35]  Ronald L. Wasserstein,et al.  Monte Carlo: Concepts, Algorithms, and Applications , 1997 .

[36]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[37]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[39]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[40]  Yunxin Zhao,et al.  Model complexity optimization for nonnative English speakers , 2001, INTERSPEECH.

[41]  Dirk P. Kroese,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[42]  Xiaodong Cui,et al.  Clustering of bootstrapped acoustic model with full covariance , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Solomon Teferra Abate,et al.  Automatic speech recognition for an under-resourced language - amharic , 2007, INTERSPEECH.