Segment-based stochastic models of spectral dynamics for continuous speech recognition

This dissertation addresses the problem of modeling the joint time-spectral structure of speech for recognition. Four areas are covered in this work: segment modeling, estimation, recognition search algorithms, and extension to a more general class of models. A unified view of the acoustic models that are currently used in speech recognition is presented; the research is then focused on segment-based models that provide a better framework for modeling the intrasegmental statistical dependencies than the conventional hidden Markov models (HMMs). The validity of a linearity assumption for modeling the intrasegmental statistical dependencies is first checked, and it is shown that the basic assumption of conditionally independent observations given the underlying state sequence that is inherent to HMMs is inaccurate. Based on these results, linear models are chosen for the distribution of the observations within a segment of speech. Motivated by the original work of the stochastic segment model, a dynamical system segment model is proposed for continuous speech recognition. Training of this model is equivalent to the maximum likelihood identification of a stochastic linear system, and a simple alternative to the traditional approach is developed. This procedure is based on the Expectation-Maximization algorithm and is analogous to the Baum-Welch algorithm for HMMs, since the dynamical system segment model can be thought of as a continuous state HMM. Recognition involves computing the probability of the innovations given by Kalman filtering. The large computational complexity of segment-based models is dealt with by the introduction of fast recognition search algorithms as alternatives to the typical Dynamic Programming search. A Split-and-Merge segmentation algorithm is developed that achieves a significant computation reduction with no loss in recognition performance. Finally, the models are extended to the family of embedded segment models that are better suited for capturing the hierarchical structure of speech and modeling intersegmental statistical dependencies. Experimental results are based on speaker-independent phoneme recognition using the TIMIT database, and represent the best context-independent phoneme recognition performance reported on this task. In addition, the proposed dynamical system segment model is the first that removes the output independence assumption.

[1]  Ingrid Daubechies,et al.  The wavelet transform, time-frequency localization and signal analysis , 1990, IEEE Trans. Inf. Theory.

[2]  Mitch Weintraub,et al.  The decipher speech recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[3]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[4]  S.E. Levinson,et al.  Structural methods in automatic speech recognition , 1985, Proceedings of the IEEE.

[5]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  D. Rainton Speech recognition-a time-frequency subspace filtering based approach , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[7]  P. Caines Linear Stochastic Systems , 1988 .

[8]  V.W. Zue,et al.  The use of speech knowledge in automatic speech recognition , 1985, Proceedings of the IEEE.

[9]  V. Digalakis,et al.  Maximum Likelihood Identification Of A Dynamical System Model For Speech Using The EM Algorithm , 1991, Proceedings. 1991 IEEE International Symposium on Information Theory.

[10]  Neri Merhav,et al.  Hidden Markov modeling using the most likely state sequence , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[11]  J. Rissanen Information in prediction and estimation , 1983, The 22nd IEEE Conference on Decision and Control.

[12]  Lawrence R. Rabiner,et al.  A segmental k-means training procedure for connected word recognition , 1986, AT&T Technical Journal.

[13]  S. Rocous,et al.  Stochastic segment modeling using the estimate-maximize algorithm , 1988 .

[14]  Lalit R. Bahl,et al.  Continuous parameter acoustic processing for recognition of a natural speech corpus , 1981, ICASSP.

[15]  Victor Zue,et al.  Recent Progress on the SUMMIT System , 1990, HLT.

[16]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[17]  Vishwa Gupta,et al.  Integration of acoustic information in a large vocabulary word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Mari Ostendorf,et al.  Improvements in the Stochastic Segment Model for Phoneme Recognition , 1989, HLT.

[19]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  S. D. Gray,et al.  Filtering of colored noise for speech enhancement and coding , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[21]  Frank Fallside,et al.  Phoneme Recognition from the TIMIT database using Recurrent Error Propa-gation Networks , 1990 .

[22]  F. Fallside,et al.  Continuous speech recognition for the TIMIT database using neural networks , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[23]  L. R. Rabiner,et al.  Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[24]  Thomas W. Parsons,et al.  Voice and Speech Processing , 1986 .

[25]  Mari Ostendorf,et al.  Integration of Diverse Recognition Methodologies Through Reevaluation of N-Best Sentence Hypotheses , 1991, HLT.

[26]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[27]  Thomas Lengauer,et al.  Combinatorial algorithms for integrated circuit layout , 1990, Applicable theory in computer science.

[28]  Mari Ostendorf,et al.  Context modeling with the stochastic segment model , 1992, IEEE Trans. Signal Process..

[29]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[30]  R. Mehra,et al.  Computational aspects of maximum likelihood estimation and reduction in sensitivity function calculations , 1974 .

[31]  de Ng Dick Bruijn,et al.  Uncertainty principles in Fourier analysis , 1967 .

[32]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[33]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[34]  John Makhoul,et al.  BYBLOS: The BBN continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Patrick Kenny,et al.  A linear predictive HMM for vector-valued observations with applications to speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[36]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[37]  Theodosios Pavlidis,et al.  Picture Segmentation by a Tree Traversal Algorithm , 1976, JACM.

[38]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Richard Kronland-Martinet,et al.  Analysis of Sound Patterns through Wavelet transforms , 1987, Int. J. Pattern Recognit. Artif. Intell..

[40]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[41]  L. Deng,et al.  Modeling microsegments of stop consonants in a hidden Markov model based word recognizer , 1990 .

[42]  George R. Doddington,et al.  Frame-specific statistical features for speaker independent speech recognition , 1986, IEEE Trans. Acoust. Speech Signal Process..

[43]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[44]  Gilbert Strang,et al.  Wavelets and Dilation Equations: A Brief Introduction , 1989, SIAM Rev..

[45]  Mari Ostendorf,et al.  The stochastic segment model for continuous speech recognition , 1991, [1991] Conference Record of the Twenty-Fifth Asilomar Conference on Signals, Systems & Computers.

[46]  Shozo Makino,et al.  Recognition of phonemes using time-spectrum pattern , 1986, Speech Commun..

[47]  S. M. Peeling,et al.  The use of variable frame rate analysis in speech recognition , 1991 .

[48]  Alan S. Willsky,et al.  Kalman filtering and Riccati equations for multiscale processes , 1990, 29th IEEE Conference on Decision and Control.

[49]  A. Kumar,et al.  Derivative computations for the log likelihood function , 1982 .

[50]  J. Makhoul,et al.  Vector quantization in speech coding , 1985, Proceedings of the IEEE.

[51]  Mari Ostendorf,et al.  Fast algorithms for phone classification and recognition using segment-based models , 1992, IEEE Trans. Signal Process..

[52]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[53]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[54]  Mari Ostendorf,et al.  A stochastic segment model for phoneme-based continuous speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[55]  Victor W. Zue,et al.  Phonetic classification using multi-layer perceptrons , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[56]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[57]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[58]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[59]  Karl Johan Åström,et al.  Numerical Identification of Linear Dynamic Systems from Normal Operating Records , 1965 .

[60]  R. Okafor Maximum likelihood estimation from incomplete data , 1987 .

[61]  Mari Ostendorf,et al.  A Dynamical System Approach to Continuous Speech Recognition , 1991, HLT.

[62]  Theodosios Pavlidis,et al.  Segmentation of Plane Curves , 1974, IEEE Transactions on Computers.

[63]  Helen Meng,et al.  Signal representation comparison for phonetic classification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[64]  James Glass,et al.  Acoustic segmentation and phonetic classification in the SUMMIT system , 1988 .

[65]  James K. Baker,et al.  On the interaction between true source, training, and testing language models , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[66]  Y. Chien,et al.  Pattern classification and scene analysis , 1974 .

[67]  George Zavaliagkos,et al.  Continuous speech recognition using segmental neural nets , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[68]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[69]  E. F. Velez,et al.  Transient analysis of speech signals using the Wigner time-frequency representation , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[70]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[71]  Dimitri Kanevsky,et al.  Matrix fast match: a fast method for identifying a short list of candidate words for decoding , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[72]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[73]  Neri Merhav,et al.  A Bayesian classification approach with application to speech recognition , 1991, IEEE Trans. Signal Process..

[74]  Achi Brandt,et al.  Multi-level approaches to discrete-state and stochastic problems , 1986 .

[75]  N. Sandell,et al.  MAXIMUM LIKELIHOOD IDENTIFICATION OF STATE SPACE MODELS FOR LINEAR DYNAMIC SYSTEMS , 1978 .