Maximum likelihood and minimum classification error factor analysis for automatic speech recognition

Hidden Markov models (HMMs) for automatic speech recognition rely on high dimensional feature vectors to summarize the short-time properties of speech. Correlations between features can arise when the speech signal is nonstationary or corrupted by noise. We investigate how to model these correlations using factor analysis, a statistical method for dimensionality reduction. Factor analysis uses a small number of parameters to model the covariance structure of high dimensional data. These parameters can be chosen in two ways: (1) to maximize the likelihood of observed speech signals, or (2) to minimize the number of classification errors. We derive an expectation-maximization (EM) algorithm for maximum likelihood estimation and a gradient descent algorithm for improved class discrimination. Speech recognizers are evaluated on two tasks, one small-sized vocabulary (connected alpha-digits) and one medium-sized vocabulary (New Jersey town names). We find that modeling feature correlations by factor analysis leads to significantly increased likelihoods and word accuracies. Moreover, the rate of improvement with model size often exceeds that observed in conventional HMM's.

[1]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter modeling for speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[2]  Shigeru Katagiri,et al.  Discriminative metric design for robust pattern recognition , 1997, IEEE Trans. Signal Process..

[3]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[4]  Li Deng,et al.  HMM-based speech recognition using state-dependent, discriminatively derived transforms on mel-warped DFT features , 1997, IEEE Trans. Speech Audio Process..

[5]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[6]  Andrej Ljolje The importance of cepstral parameter correlations in speech recognition , 1994, Comput. Speech Lang..

[7]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[8]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[10]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[11]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Nanda Kambhatla,et al.  Fast Non-Linear Dimension Reduction , 1993, NIPS.

[13]  Yariv Ephraim,et al.  Estimation of hidden Markov model parameters by minimizing empirical error rate , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[14]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[15]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[16]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[17]  Miguel Á. Carreira-Perpiñán,et al.  Dimensionality reduction of electropalatographic data using latent variable models , 1998, Speech Commun..

[18]  Geoffrey E. Hinton,et al.  Modeling the manifolds of images of handwritten digits , 1997, IEEE Trans. Neural Networks.

[19]  Erkki Oja,et al.  Subspace methods of pattern recognition , 1983 .

[20]  Vladimir Vapnik,et al.  Methods of Pattern Recognition , 2000 .

[21]  Heinrich Niemann,et al.  Optimal linear feature transformations for semi-continuous hidden Markov models , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[22]  George Zavaliagkos,et al.  Batch, incremental and instantaneous adaptation techniques for speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[23]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[24]  M.G. Rahim,et al.  Signal conditioning techniques for robust speech recognition , 1996, IEEE Signal Processing Letters.

[25]  Brian Everitt,et al.  An Introduction to Latent Variable Models , 1984 .

[26]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[27]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[28]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[29]  Todd K. Leen,et al.  Fast nonlinear dimension reduction , 1993, IEEE International Conference on Neural Networks.

[30]  Dorothy T. Thayer,et al.  EM algorithms for ML factor analysis , 1982 .

[31]  Michael E. Tipping,et al.  Mixtures of Principal Component Analysers , 1997 .

[32]  Stephen M. Omohundro,et al.  Nonlinear Image Interpolation using Manifold Learning , 1994, NIPS.

[33]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.