HMM Mixtures (HMM2) for Robust Speech Recognition

State-of-the-art automatic speech recognition (ASR) techniques are typically based on hidden Markov models (HMMs) for the modeling of temporal sequences of feature vectors extracted from the speech signal. At the level of each HMM state, Gaussian mixture models (GMMs) or artificial neural networks (ANNs) are commonly used in order to model the state emission probabilities. However, both GMMs and ANNs are rather rigid, as they are incapable of adapting to variations inherent in the speech signal, such as inter- and intra-speaker variations. Moreover, performance degradations of these systems are severe in the case of unmatched conditions such as in the presence of environmental noise. A lot of research effort is currently being devoted to overcoming these problems. The principal objective of this thesis is to explore new approaches towards a more robust and adaptive modeling of speech. In this context, different aspects of the modeling of speech data with HMMs and GMMs are investigated. Particular attention is given to the modeling of correlation. While correlation between different feature vectors (corresponding to temporal correlation) is typically modeled by the HMM, correlation between feature vector components (e.g., correlation in frequency) is modeled by the GMM part of the model. This thesis starts with the investigation of two potential ways to improve the modeling of correlation, consisting of (1) a shift of the modeling of temporal correlation towards GMMs, and (2) the modeling of correlation within each feature vector by a particular type of HMM. This leads to the development of a novel approach, referred to as OHMM2O, which is a major focus of this thesis. HMM2 is a particular mixture of hidden Markov models, where state emission probabilities of the temporal (primary) HMM are modeled through (secondary) state-dependent frequency-based HMMs. Low-dimensional GMMs are used for modeling the state emission probabilities of the secondary HMM states. Therefore, HMM2 can be seen as a generalization of conventional HMMs, which they include as a particular case. HMM2 may have several advantages as compared to standard systems. While the primary HMM performs time warping and time integration, the secondary HMM performs warping and integration along the frequency dimension of the speech signal. Frequency correlation is modeled through the secondary HMM topology. Due to the implicit, non-linear, state-dependent spectral warping performed by the secondary HMM, HMM2 may be viewed as a dynamic extension of the multi-band approach. Moreover, this frequency warping property may result in a better, more flexible modeling and parameter sharing. After an investigation of theoretical and practical aspects of HMM2, encouraging recognition results for the case of speech degraded by additive noise are given. Due to the spectral warping property of HMM2, this model is able to extract pertinent structural information of the speech signal, which is reflected in the trained model parameters. Consequently, such an HMM2 system can also be used to explicitly extract structures of a speech signal, which can then be converted into a new kind of ASR features, referred to as OHMM2 featuresO. In fact, frequency bands with similar characteristics are supposed to be emitted by the same secondary HMM state. The warping along the frequency dimension of speech thus results in an adaptable, data-driven frequency segmentation. In fact, as it can be assumed that different secondary HMM states model spectral regions characterized by high and low energies respectively, this segmentation may be related to formant structures. The application of HMM2 as a feature extractor is investigated, and it is shown that a system combining HMM2 features with conventional noise-robust features yields an improved speech recognition robustness. Moreover, a comparison of HMM2 features with formant tracks shows a comparable performance on a vowel classification task.

[1]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[2]  I N Bronstein,et al.  Taschenbuch der Mathematik , 1966 .

[3]  Louis C. W. Pols,et al.  Incorporating knowledge on segmental duration in HMM-based continuous speech recognition , 1997 .

[4]  Samy Bengio,et al.  Evaluation of formant-like features on an automatic vowel classification task. , 2004, The Journal of the Acoustical Society of America.

[5]  Gary E. Kopec Formant tracking using hidden Markov models and vector quantization , 1986, IEEE Trans. Acoust. Speech Signal Process..

[6]  Hermann Ney,et al.  Dynamic programming algorithm for optimal estimation of speech parameter contours , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[7]  Omar Farooq,et al.  Mel filter-like admissible wavelet packet structure for speech recognition , 2001, IEEE Signal Processing Letters.

[8]  Climent Nadeu,et al.  TIME AND FREQUENCY FILTERING FOR SPEECH RECOGNITION IN REAL NOISE CONDITIONS , 2001 .

[9]  Oscar E. Agazzi,et al.  Machine vision for keyword spotting using pseudo 2D hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Hynek Hermansky,et al.  Towards increasing speech recognition error rates , 1995, Speech Commun..

[11]  Ronald A. DeVore,et al.  Image compression through wavelet transform coding , 1992, IEEE Trans. Inf. Theory.

[12]  Renato De Mori,et al.  Augmenting standard speech recognition features with energy gravity centres , 2001, Comput. Speech Lang..

[13]  Hervé Glotin,et al.  Multi-stream adaptive evidence combination for noise robust ASR , 2001, Speech Commun..

[14]  Nelson Morgan Temporal Signal Processing for ASR , 1999 .

[15]  Mark A. Hasegawa-Johnson,et al.  Formant and burst spectral measurements with quantitative error models for speech sound classification , 1996 .

[16]  Climent Nadeu,et al.  On the decorrelation of filter-bank energies in speech recognition , 1995, EUROSPEECH.

[17]  Christopher John Long,et al.  Discriminant wavelet basis construction for speech recognition , 1998, ICSLP.

[18]  Daniel P. W. Ellis STREAM COMBINATION BEFORE AND/OR AFTER THE ACOUSTIC MODEL , 1999 .

[19]  Shubha L. Kadambe,et al.  Applications of adaptive wavelets for speech , 1994 .

[20]  Robert D. Nowak,et al.  Wavelet-based statistical signal processing using hidden Markov models , 1998, IEEE Trans. Signal Process..

[21]  Samy Bengio,et al.  Towards Robust and Adaptive Speech Recognition Models , 2002 .

[22]  Michel Barlaud,et al.  Image coding using wavelet transform , 1992, IEEE Trans. Image Process..

[23]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[24]  Chafic Mokbel,et al.  Combining Wavelet-domain Hidden Markov Trees with Hidden Markov Models , 1999 .

[25]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  L. Rabiner,et al.  System for automatic formant analysis of voiced speech. , 1970, The Journal of the Acoustical Society of America.

[27]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[28]  Gerhard Rigoll,et al.  High performance face recognition using pseudo 2-D hidden Markov models , 1999, 1999 European Control Conference (ECC).

[29]  I. Daubechies Ten Lectures on Wavelets , 1992 .

[30]  Ronald A. Cole,et al.  New telephone speech corpora at CSLU , 1995, EUROSPEECH.

[31]  Andrew C. Morris,et al.  A comparison of two strategies for ASR in additive noise: missing data and spectral subtraction , 1999, EUROSPEECH.

[32]  G. Rigoll,et al.  Pseudo 2-dimensional hidden Markov models in speech recognition , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[33]  Philip N. Garner,et al.  Using formant frequencies in speech recognition , 1997, EUROSPEECH.

[34]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[35]  Christopher Kermorvant A comparison of noise reduction techniques for robust speech recognition , 1999 .

[36]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[37]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[38]  Hervé Bourlard,et al.  Robust Speech Recognition based on Multi-Stream Features , 1997 .

[39]  Gérard Chollet,et al.  New cepstral representation using wavelet analysis and spectral transformation for robust speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[40]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .

[41]  Richard G. Baraniuk,et al.  Image segmentation using wavelet-domain classification , 1999, Optics & Photonics.

[42]  Samy Bengio,et al.  Robust speech recognition and feature extraction using HMM2 , 2003, Comput. Speech Lang..

[43]  Parham Zolfaghari,et al.  Formant analysis using mixtures of Gaussians , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[44]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[45]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[46]  Wendy J. Holmes Segmental HMMs: Modelling dynamics and underlying structure for automatic speech recognition , 2000 .

[47]  Samy Bengio,et al.  HMM2- a novel approach to HMM emission probability estimation , 2000, INTERSPEECH.

[48]  Philip N. Garner,et al.  On the robust incorporation of formant features into hidden Markov models for automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[49]  J. Olive Automatic Formant Tracking by a Newton-Raphson Technique , 1971 .

[50]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[51]  Etienne Barnard,et al.  Explicit N-Best Formant Features for Segment-Based Speech Recognition , 1996 .

[52]  Climent Nadeu,et al.  Time and frequency filtering of filter-bank energies for robust HMM speech recognition , 2000, Speech Commun..

[53]  Mark J. F. Gales,et al.  Robust speech recognition in additive and convolutional noise using parallel model combination , 1995, Comput. Speech Lang..

[54]  Samy Bengio,et al.  An EM Algorithm for HMMs with Emission Distributions Represented by HMMs , 2000 .

[55]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[56]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[57]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[58]  Mark J. F. Gales,et al.  A mixture of Gaussians front end for speech recognition , 2001, INTERSPEECH.

[59]  Hermann Ney,et al.  Formant estimation for speech recognition , 1998, IEEE Trans. Speech Audio Process..

[60]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[61]  A. Jongman Acoustics of American English Speech: A Dynamic Approach , 1995 .

[62]  Michael I. Jordan,et al.  Probabilistic Independence Networks for Hidden Markov Probability Models , 1997, Neural Computation.

[63]  D. Talkin Speech formant trajectory estimation using dynamic programming with modulated transition costs , 1987 .

[64]  Lei Lf Willems Robust formant analysis , 1986 .

[65]  Hervé Bourlard,et al.  Speech recognition using advanced HMM2 features , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[66]  Steve Young,et al.  The HTK book , 1995 .

[67]  Hervé Bourlard,et al.  Speaker normalization using HMM2 , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[68]  Samy Bengio,et al.  A Pragmatic View of the Application of HMM2 for ASR , 2001 .

[69]  Stephanie McCandless Automatic Formant Extraction Using Linear Prediction , 1973 .

[70]  Climent Nadeu,et al.  Comparison of spectral derivative parameters for robust speech recognition , 2001, INTERSPEECH.

[71]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[72]  S. McCandless,et al.  An algorithm for automatic formant extraction using linear prediction spectra , 1974 .

[73]  Etienne Barnard,et al.  Robust, n-best formant tracking , 1995, EUROSPEECH.

[74]  Samy Bengio,et al.  HMM2- extraction of formant structures and their use for robust ASR , 2001, INTERSPEECH.

[75]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[76]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[77]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for speech analysis , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[78]  Hervé Bourlard,et al.  Hidden Markov Models and other Finite State Automata for Sequence Processing , 2002 .

[79]  Lou Boves,et al.  Comparing acoustic features for robust ASR in fixed and cellular network applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[80]  Chafic Mokbel,et al.  Deconvolution of telephone line effects for speech recognition , 1996, Speech Commun..

[81]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[82]  Samy Bengio,et al.  IDIAP HMM/HMM2 System: Theoretical Basis and Software Specifications , 2001 .

[83]  Jeff A. Bilmes,et al.  Natural statistical models for automatic speech recognition , 1999 .

[84]  Katrin Weber Multiple Timescale Feature Combination Towards Robust Speech Recognition , 2000, KONVENS.

[85]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[86]  J. Hillenbrand,et al.  Acoustic characteristics of American English vowels. , 1994, The Journal of the Acoustical Society of America.

[87]  Samy Bengio,et al.  HMM2- Extraction of Formant Features and their Use for Robust ASR , 2001 .

[88]  Astrid Hagen Robust speech recognition based on multi-stream processing , 2001 .

[89]  Naomi Harte,et al.  Multi-resolution cepstral features for phoneme recognition across speech sub-bands , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[90]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[91]  Gérard Chollet,et al.  A Markov random field based multi-band model , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[92]  Ferdinando Silvestro Samaria,et al.  Face recognition using Hidden Markov Models , 1995 .

[93]  Hervé Bourlard,et al.  Speech recognition with auxiliary information , 2004, IEEE Transactions on Speech and Audio Processing.

[94]  Steven Greenberg,et al.  Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[95]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[96]  Christopher John Long,et al.  Wavelet based feature extraction for phoneme recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[97]  Marie-Odile Berger,et al.  A new paradigm for reliable automatic formant tracking , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[98]  Melvyn J. Hunt,et al.  Delayed decisions in speech recognition - The case of formants , 1987, Pattern Recognit. Lett..

[99]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition: Fundamentals and Applications , 1995 .

[100]  J P Olive Automatic formant tracking by a Newton-Raphson technique. , 1971, The Journal of the Acoustical Society of America.

[101]  Samy Bengio,et al.  Evaluation of formant-like features for automatic speech recognition 1 , 2003 .

[102]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[103]  Samy Bengio,et al.  Evaluation of formant-like features for ASR , 2002, INTERSPEECH.

[104]  Samy Bengio,et al.  New Approaches Towards Robust and Adaptive Speech Recognition , 2000, NIPS 2000.

[105]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[106]  Alexandros Potamianos,et al.  Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[107]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[108]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[109]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[110]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[111]  Roger K. Moore,et al.  Simultaneous recognition of concurrent speech signals using hidden Markov model decomposition , 1991, EUROSPEECH.

[112]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[113]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[114]  Mukund Padmanabhan Spectral peak tracking and its use in speech recognition , 2000, INTERSPEECH.

[115]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[116]  C. Lefebvre,et al.  A comparison of several acoustic representations for speech recognition with degraded and undegraded speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[117]  Samy Bengio,et al.  Increasing Speech Recognition Noise Robustness with HMM2 , 2001 .

[118]  V. Rich Personal communication , 1989, Nature.

[119]  Alex Acero,et al.  Formant analysis and synthesis using hidden Markov models , 1999, EUROSPEECH.

[120]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[121]  David Kryze,et al.  A NEW NOISE-ROBUST SUBBAND FRONT-END AND ITS COMPARISON TO P LP , 1999 .

[122]  Joseph P. Olive,et al.  Formant tracking using segmental phonemic information , 1999, EUROSPEECH.

[123]  Nelson Morgan,et al.  Evaluating long-term spectral subtraction for reverberant ASR , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[124]  Jeff A. Bilmes,et al.  Buried Markov models for speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).