From HMM's to segment models: a unified view of stochastic modeling for speech recognition

Many alternative models have been proposed to address some of the shortcomings of the hidden Markov model (HMM), which is currently the most popular approach to speech recognition. In particular, a variety of models that could be broadly classified as segment models have been described for representing a variable-length sequence of observation vectors in speech recognition applications. Since there are many aspects in common between these approaches, including the general recognition and training problems, it is useful to consider them in a unified framework. The paper describes a general stochastic model that encompasses most of the models proposed in the literature, pointing out similarities of the models in terms of correlation and parameter tying assumptions, and drawing analogies between segment models and HMMs. In addition, we summarize experimental results assessing different modeling assumptions and point out remaining open questions.

[1]  H. Hartley Maximum Likelihood Estimation from Incomplete Data , 1958 .

[2]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[3]  A. B. Poritz,et al.  Linear predictive hidden Markov models and the speech signal , 1982, ICASSP.

[4]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Biing-Hwang Juang,et al.  Mixture autoregressive hidden Markov models for speech signals , 1985, IEEE Trans. Acoust. Speech Signal Process..

[6]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[8]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[9]  Lawrence R. Rabiner,et al.  A segmental k-means training procedure for connected word recognition , 1986, AT&T Technical Journal.

[10]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[11]  M. Bush,et al.  Network-based connected digit recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12]  R. Okafor Maximum likelihood estimation from incomplete data , 1987 .

[13]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  S. Rocous,et al.  Stochastic segment modeling using the estimate-maximize algorithm , 1988 .

[15]  Herbert Gish,et al.  Stochastic segment modelling using the estimate-maximize algorithm (speech recognition) , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[16]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[17]  Mari Ostendorf,et al.  Improvements in the Stochastic Segment Model for Phoneme Recognition , 1989, HLT.

[18]  James Glass,et al.  Acoustic segmentation and phonetic classification in the SUMMIT system , 1988, International Conference on Acoustics, Speech, and Signal Processing,.

[19]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[20]  Chris Barry,et al.  Robust smoothing methods for discrete hidden Markov models , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[21]  Mari Ostendorf,et al.  A stochastic segment model for phoneme-based continuous speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[22]  Ken-ichi Iso,et al.  Speaker-independent word recognition using a neural prediction model , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[23]  Chin-Hui Lee,et al.  Acoustic modeling for large vocabulary speech recognition , 1990 .

[24]  Mari Ostendorf,et al.  Joint quantizer design and parameter estimation for discrete hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[25]  Patrick Kenny,et al.  A linear predictive HMM for vector-valued observations with applications to speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[26]  Hisashi Wakita,et al.  Neural predictive hidden Markov model , 1990, ICSLP.

[27]  Hervé Bourlard,et al.  Continuous speech recognition using multilayer perceptrons with hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[28]  Alex Waibel,et al.  Large vocabulary recognition using linked predictive neural networks , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[29]  Esther Levin,et al.  Word recognition using hidden control neural architecture , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[30]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Mari Ostendorf,et al.  The stochastic segment model for continuous speech recognition , 1991, [1991] Conference Record of the Twenty-Fifth Asilomar Conference on Signals, Systems & Computers.

[32]  Michael Picheny,et al.  Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[33]  Stephen E. Levinson,et al.  Development of an acoustic-phonetic hidden Markov model for continuous speech recognition , 1991, IEEE Trans. Signal Process..

[34]  Mari Ostendorf,et al.  Integration of Diverse Recognition Methodologies Through Reevaluation of N-Best Sentence Hypotheses , 1991, HLT.

[35]  Victor Zue,et al.  Speech recognition using stochastic explicit-segment modeling , 1991, EUROSPEECH.

[36]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[37]  Mari Ostendorf,et al.  A Dynamical System Approach to Continuous Speech Recognition , 1991, HLT.

[38]  Helen Meng,et al.  Signal representation comparison for phonetic classification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[39]  Neri Merhav,et al.  Hidden Markov modeling using the most likely state sequence , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[40]  Philip C. Woodland,et al.  Hidden Markov models using vector linear prediction and discriminative output distributions , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Shigeki Sagayama,et al.  A successive state splitting algorithm for efficient allophone modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Herbert Gish,et al.  Secondary processing using speech segments for an HMM word spotting system , 1992, ICSLP.

[43]  Mari Ostendorf,et al.  Context modeling with the stochastic segment model , 1992, IEEE Trans. Signal Process..

[44]  Mei-Yuh Hwang,et al.  Subphonetic Modeling for Speech Recognition , 1992, HLT.

[45]  Mari Ostendorf,et al.  Recognition Using Classification and Segmentation Scoring , 1992, HLT.

[46]  Chin-Hui Lee,et al.  MAP Estimation of Continuous Density HMM : Theory and Applications , 1992, HLT.

[47]  Vassilios Digalakis,et al.  Segment-based stochastic models of spectral dynamics for continuous speech recognition , 1992 .

[48]  Mari Ostendorf,et al.  Weight Estimation for N-Best Rescoring , 1992, HLT.

[49]  Mari Ostendorf,et al.  Continuous Word Recognition Based on the Stochastic Segment Model , 1992 .

[50]  Hong C. Leung,et al.  Speech recognition using stochastic segment neural networks , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  Hervé Bourlard,et al.  CDNN: a context dependent neural network for continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  Mari Ostendorf,et al.  Fast algorithms for phone classification and recognition using segment-based models , 1992, IEEE Trans. Signal Process..

[53]  Martin Russell,et al.  A segmental HMM for speech pattern modelling , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  Mari Ostendorf,et al.  On the Use of Tied-Mixture Distributions , 1993, HLT.

[55]  Oded Ghitza,et al.  Hidden Markov models with templates as non-stationary states: an application to speech recognition , 1993, Comput. Speech Lang..

[56]  Herbert Gish,et al.  A segmental speech model with applications to word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[57]  J. R. Rohlicek,et al.  ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition , 1993, IEEE Trans. Speech Audio Process..

[58]  Kuldip K. Paliwal,et al.  Use of temporal correlation between successive frames in a hidden Markov model based speech recognizer , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[59]  Mark J. F. Gales,et al.  The theory of segmental hidden Markov models , 1993 .

[60]  N. M. Veilleuz,et al.  Prosody/Parse Scoring and Its Application in ATIS , 1993, HLT.

[61]  Satoshi Takahashi,et al.  Phoneme HMMs constrained by frame correlations , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[62]  Marco Saerens,et al.  Linear and nonlinear prediction for speech recognition with hidden Markov models , 1993, EUROSPEECH.

[63]  Mari Ostendorf,et al.  A comparison of trajectory and mixture modeling in segment-based word recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[64]  Li Deng,et al.  A stochastic model of speech incorporating hierarchical nonstationarity , 1993, IEEE Trans. Speech Audio Process..

[65]  Steve J. Young,et al.  The HTK tied-state continuous speech recogniser , 1993, EUROSPEECH.

[66]  Yifan Gong,et al.  Nonlinear time alignment in stochastic trajectory models for speech recognition , 1994, ICSLP.

[67]  Mari Ostendorf,et al.  Maximum likelihood clustering of Gaussians for speech recognition , 1994, IEEE Trans. Speech Audio Process..

[68]  Vassilios Digalakis,et al.  Combining Knowledge Sources to Reorder N-Best Speech Hypothesis Lists , 1994, HLT.

[69]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[70]  Mohamed I. Elmasry,et al.  Analysis of the correlation structure for a neural predictive model with application to speech recognition , 1994, Neural Networks.

[71]  George Zavaliagkos,et al.  Is N-Best Dead? , 1994, HLT.

[72]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[73]  James R. Glass,et al.  Statistical trajectory models for phonetic recognition , 1994, ICSLP.

[74]  Li Deng,et al.  Automatic speech recognition using dynamically defined speech units , 1994, ICSLP.

[75]  L. Deng,et al.  State-dependent time warping in the trended hidden Markov model , 1994, Signal Process..

[76]  Alexander H. Waibel,et al.  Towards better language models for spontaneous speech , 1994, ICSLP.

[77]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[78]  S. Krishnan,et al.  Segmental phoneme recognition using piecewise linear regression , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[79]  Yifan Gong,et al.  Stochastic trajectory modeling for speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[80]  James R. Glass,et al.  Empirical acquisition of language models for speech recognition , 1994, ICSLP.

[81]  Vassilios Digalakis,et al.  Genones: optimizing the degree of mixture tying in a large vocabulary hidden Markov model based speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[82]  Xiaodong Sun,et al.  Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states , 1994, IEEE Trans. Speech Audio Process..

[83]  Mari Ostendorf,et al.  A dynamical system model for generating F0 for synthesis , 1994, SSW.

[84]  George Zavaliagkos,et al.  A hybrid segmental neural net/hidden Markov model system for continuous speech recognition , 1994, IEEE Trans. Speech Audio Process..

[85]  O. Kimball,et al.  Segment modeling alternatives for continuous speech recognition , 1995 .

[86]  Steve J. Young,et al.  Towards improved speech recognition using a speech production model , 1995, EUROSPEECH.

[87]  Jun He,et al.  A unified way in incorporating segmental feature and segmental model into HMM , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[88]  Douglas E. Appelt,et al.  Combining Linguistic and Statistical Knowledge Sources in Natural-Language Processing for ATIS , 1995 .

[89]  Mari Ostendorf,et al.  Lattice-based search strategies for large vocabulary speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[90]  Steve Renals,et al.  The 1994 Abbot hybrid connectionist-HMM large vocabulary recognition system. , 1995 .

[91]  Martin J. Russell,et al.  Experimental evaluation of segmental HMMs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[92]  Hervé Bourlard,et al.  Neural networks for statistical recognition of continuous speech , 1995, Proc. IEEE.

[93]  Li Deng,et al.  A Markov model containing state-conditioned second-order non-stationarity: application to speech recognition , 1995, Comput. Speech Lang..

[94]  Martin J. Russell,et al.  Speech recognition using a linear dynamic segmental HMM , 1995, EUROSPEECH.

[95]  Helmut Lucke,et al.  Which stochastic models allow Baum-Welch training? , 1996, IEEE Trans. Signal Process..

[96]  Jan P. H. van Santen,et al.  Segmental Duration and Speech Timing , 1997, Computing Prosody.