Semi-continuous hidden Markov models for speech recognition

Hidden Markov models, which can be based on either discrete output probability distributions or continuous mixture output density functions, have been demonstrated as one of the most powerful statistical tools available for automatic speech recognition. In this thesis, a semi-continuous hidden Markov model, which is a very general model including both discrete and continuous mixture hidden Markov models as its special forms, is proposed. It is a model in which vector quantisation, the discrete hidden Markov model, and the continuous mixture hidden Markov model are unified. Based on the assumption that each vector quantisation codeword can be represented by a continuous probability density function, the semi-continuous output probability is then a combination of discrete model-dependent weighting coefficients with these continuous codebook probability density functions. In comparison to the conventional continuous mixture hidden Markov model, the semi-continuous hidden Markov model can maintain the modelling ability of large-mixture probability density functions. In addition, the number of free parameters and the computational complexity can be reduced because all of the probability density functions are tied together in the codebook. The semicontinuous hidden Markov model thus provides a good solution to the conflict between detailed acoustic modelling and insufficient training data. In comparison to the conventional discrete hidden Markov model, robustness can be enhanced by using multiple codewords in deriving the semi-continuous output probability; and the vector quantisation codebook itself can be optimised together with the hidden Markov model parameters in terms of the maximum likelihood criterion. Such a unified modelling can substantially minimise the information lost by conventional vector quantisation. Evaluation of the semi-continuous hidden Markov model was carried out in a range of speech recognition experiments and results have clearly demonstrated that the semicontinuous hidden Markov model offers improved speech recognition accuracy in comparison to both the discrete hidden Markov model and the continuous mixture hidden Markov model. It is concluded that the unified modelling theory is indeed powerful for modelling non-stationary stochastic processes with multi-modal probabilistic functions of Markov chains, and as such is very useful for automatic speech recognition.

[1]  Alan V. Oppenheim,et al.  Methods for noise cancellation based on the EM algorithm , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Chris Barry,et al.  Robust smoothing methods for discrete hidden Markov models , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[3]  James K. Baker,et al.  Stochastic modeling as a means of automatic speech recognition. , 1975 .

[4]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[5]  Victor Zue,et al.  Properties of large lexicons: Implications for advanced isolated word recognition systems , 1982, ICASSP.

[6]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[7]  A. Nadas,et al.  Automatic speech recognition via pseudo-independent marginal mixtures , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Mei-Yuh Hwang,et al.  The SPHINX speech recognition system , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[9]  Richard P. Lippmann Neutral nets for computing , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[10]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[11]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[12]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[13]  Michael Picheny,et al.  Acoustic Markov models used in the Tangora speech recognition system , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[14]  Aaron E. Rosenberg,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Michael D. Brown,et al.  An algorithm for connected word recognition , 1982, ICASSP.

[16]  D. B. Paul,et al.  Speaker stress-resistant continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[17]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[19]  Michael Picheny,et al.  Large vocabulary natural language continuous speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[20]  V. Hasselblad Estimation of parameters for a mixture of normal distributions , 1966 .

[21]  C. Lefebvre,et al.  A comparison of several acoustic representations for speech recognition with degraded and undegraded speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[22]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[23]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[24]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[25]  L. Rabiner,et al.  System for automatic formant analysis of voiced speech. , 1970, The Journal of the Acoustical Society of America.

[26]  A. Cook,et al.  Experimental evaluation of duration modelling techniques for automatic speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  M. Lennig,et al.  Modeling acoustic-phonetic detail in an HMM-based large vocabulary speech recognizer , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[28]  Masafumi Nishimura,et al.  HMM-Based speech recognition using multi-dimensional multi-labeling , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[30]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[32]  Richard M. Schwartz,et al.  A preliminary design of a phonetic vocoder based on a diphone model , 1980, ICASSP.

[33]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[34]  John Laver,et al.  Experiments with template adaptation in an isolated word recognition system , 1989, ECST.

[35]  Lalit R. Bahl,et al.  Further results on the recognition of a continuously read natural corpus , 1980, ICASSP.

[36]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[37]  Vishwa Gupta,et al.  Integration of acoustic information in a large vocabulary word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  N. E. Day Estimating the components of a mixture of normal distributions , 1969 .

[39]  J. Baker Trainable grammars for speech recognition , 1979 .

[40]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[41]  George R. Doddington Phonetically sensitive discriminants for improved speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[42]  Bernard Mérialdo,et al.  Natural Language Modeling for Phoneme-to-Text Transcription , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  D. Burton,et al.  Isolated-word speech recognition using multisection vector quantization codebooks , 1984, IEEE Trans. Acoust. Speech Signal Process..

[44]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[45]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[46]  A. B. Poritz,et al.  Linear predictive hidden Markov models and the speech signal , 1982, ICASSP.

[47]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[48]  B. Atal,et al.  Predictive coding of speech signals and subjective error criteria , 1979 .

[49]  T. Petrie Probabilistic functions of finite-state markov chains. , 1967, Proceedings of the National Academy of Sciences of the United States of America.

[50]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[51]  Hervé Bourlard,et al.  Speech dynamics and recurrent neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[52]  C. Myers,et al.  A level building dynamic time warping algorithm for connected word recognition , 1981 .

[53]  Bruce Lowerre,et al.  The Harpy speech understanding system , 1990 .

[54]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[55]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[56]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[57]  L. R. Rabiner,et al.  On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition , 1983, The Bell System Technical Journal.

[58]  Anne-Marie Derouault,et al.  Context-dependent phonetic Markov models for large vocabulary speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[59]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[61]  Kenji Kita,et al.  HMM continuous speech recognition using predictive LR parsing , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[62]  V.W. Zue,et al.  The use of speech knowledge in automatic speech recognition , 1985, Proceedings of the IEEE.

[63]  Kevin J. Lang,et al.  Speech recognition using time‐delay neural networks , 1988 .

[64]  H. Sakoe,et al.  Two-level DP-matching--A dynamic programming-based pattern matching algorithm for connected word recognition , 1979 .

[65]  Kai-Fu Lee Hidden Markov models: past, present, and future , 1989, EUROSPEECH.

[66]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[67]  Masaru Tomita,et al.  Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems , 1985 .

[68]  Xuedong Huang,et al.  Unified techniques for vector quantization and hidden Markov modeling using semi-continuous models , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[69]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[70]  Richard M. Stern,et al.  Dynamic speaker adaptation for feature-based isolated word recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[71]  James Glass,et al.  Acoustic segmentation and phonetic classification in the SUMMIT system , 1988, International Conference on Acoustics, Speech, and Signal Processing,.

[72]  Douglas B. Paul,et al.  An 800 bps adaptive vector quantization vocoder using a perceptual distance measure , 1983, ICASSP.

[73]  Michael Witbrock,et al.  A connectionist approach to continuous speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[74]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[75]  L. R. Rabiner,et al.  Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[76]  Xuedong Huang,et al.  Large-vocabulary speaker-independent continuous speech recognition with semi-continuous hidden Markov models , 1989, EUROSPEECH.

[77]  Bishnu S. Atal,et al.  Predictive coding of speech signals and subjective error criteria , 1978, ICASSP.

[78]  Serge Soudoplatoff,et al.  Markov modeling of continuous parameters in speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[79]  Lalit R. Bahl,et al.  Experiments with the Tangora 20,000 word speech recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[80]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[81]  J. Doob Stochastic processes , 1953 .

[82]  Philip E. Gill,et al.  Practical optimization , 1981 .

[83]  J. Makhoul,et al.  Vector quantization in speech coding , 1985, Proceedings of the IEEE.

[84]  Andreas Noll,et al.  A data-driven organization of the dynamic programming beam search for continuous speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[85]  Frederick Jelinek,et al.  The development of an experimental discrete dictation recognizer , 1985 .

[86]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[87]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[88]  Lalit R. Bahl,et al.  Continuous speech recognition with automatically selected acoustic prototypes obtained by either bootstrapping or clustering , 1981, ICASSP.

[89]  Fergus McInnes,et al.  Adaptation of reference patterns in word-based speech recognition , 1988 .

[90]  M.I. Miller,et al.  The role of likelihood and entropy in incomplete-data problems: Applications to estimating point-process intensities and toeplitz constrained covariances , 1987, Proceedings of the IEEE.

[91]  A. Poritz,et al.  On hidden Markov models in isolated word recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[92]  Chin-Hui Lee,et al.  Word recognition using whole word and subword models , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[93]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[94]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[95]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[96]  Yen-Lu Chow Salim RoJor SPEECH UNDERSTANDING USING A UNIFICATION GRAMMAR , 1989 .

[97]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[98]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[99]  David J. Burr Speech Recognition Experiments with Perceptrons , 1987, NIPS.

[100]  S. Euler,et al.  Isolated word recognition using hidden Markov models , 1988 .

[101]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[102]  R. Gray,et al.  Distortion measures for speech processing , 1980 .

[103]  Steve Renals,et al.  Learning phoneme recognition using neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[104]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[105]  Yasuo Ariki,et al.  Enhanced time duration constraints in hidden Markov modelling for phoneme recognition , 1989 .

[106]  M. Jack,et al.  Hidden Markov modelling of speech based on a semicontinuous model , 1988 .

[107]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[108]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[109]  Joseph Picone,et al.  Speech recognition in a unification grammar framework , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[110]  B.-H. Juang,et al.  On the hidden Markov model and dynamic time warping for speech recognition — A unified view , 1984, AT&T Bell Laboratories Technical Journal.

[111]  Biing-Hwang Juang,et al.  Mixture autoregressive hidden Markov models for speech signals , 1985, IEEE Trans. Acoust. Speech Signal Process..

[112]  Gary E. Kopec Formant tracking using hidden Markov models and vector quantization , 1986, IEEE Trans. Acoust. Speech Signal Process..

[113]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[114]  Robert M. Gray,et al.  Multiple local optima in vector quantizers , 1982, IEEE Trans. Inf. Theory.

[115]  John Makhoul,et al.  BYBLOS: The BBN continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[116]  Vishwa Gupta,et al.  Three probabilistic language models for a large-vocabulary speech recognizer , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[117]  Marco Ferretti,et al.  Language model and acoustic model information in probabilistic speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[118]  Alfred V. Aho,et al.  The theory of parsing, translation, and compiling. 1: Parsing , 1972 .

[119]  Dennis H. Klatt,et al.  Review of the ARPA speech understanding project , 1990 .

[120]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[121]  Arthur Nádas,et al.  On Turing's formula for word probabilities , 1985, IEEE Trans. Acoust. Speech Signal Process..

[122]  D. B. Paul,et al.  The Lincoln robust continuous speech recognizer , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[123]  Frank Fallside,et al.  Computer speech processing , 1985 .

[124]  Allen Gersho,et al.  On the structure of vector quantizers , 1982, IEEE Trans. Inf. Theory.

[125]  A. Nadas,et al.  A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .