Speech recognition in adverse environments: a probabilistic approach

In this thesis I advocate a probabilistic view of robust speech recognition. I discuss the classification of distorted features using an optimal classifier, and I show how the generation of noisy speech can be represented as a generative graphical probability model. By doing so, my aim is to build a conceptual framework that provides a unified understanding of robust speech recognition, and to some extent bridges the gap between a purely signal processing viewpoint and the pattern classification or decoding viewpoint. The most tangible contribution of this thesis is the introduction of the Algonquin method for robust speech recognition. It exemplifies the probabilistic method and encompasses a number of novel ideas. For example, it uses a probability distribution to describe the relationship between clean speech, noise, channel and the resultant noisy speech. It employs a variational approach to find an approximation to the joint posterior distribution which can be used for the purpose of restoring the distorted observations. It also allows us to estimate the parameters of the environment using a Generalized EM method. Another important contribution of this thesis is a new paradigm for robust speech recognition, which we call uncertainty decoding. This new paradigm follows naturally from the standard way of performing inference in the graphical probability model that describes noisy speech generation.

[1]  Richard M. Stern,et al.  Signal Processing for Robust Speech Recognition , 1994, HLT.

[2]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Brendan J. Frey,et al.  Towards non-stationary model-based noise adaptation for large vocabulary speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Malcolm Slaney,et al.  A critique of pure audition , 1998 .

[5]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[7]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[8]  P. Woodland,et al.  Flexible speaker adaptation using maximum likelihood linear regression , 1995 .

[9]  J. Porter,et al.  Robust syntax free speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[10]  Chin-Hui Lee,et al.  Evaluating the Aurora connected digit recognition task - a bell labs approach , 2001, INTERSPEECH.

[11]  Steve Young,et al.  The HTK book , 1995 .

[12]  Brendan J. Frey,et al.  Joint estimation of noise and channel distortion in a generalized EM framework , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[13]  Brendan J. Frey,et al.  ALGONQUIN: iterating laplace's method to remove multiple types of acoustic distortion for robust speech recognition , 2001, INTERSPEECH.

[14]  Dennis H. Klatt,et al.  A digital filter bank for spectral matching , 1976, ICASSP.

[15]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[16]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[17]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[18]  Yariv Ephraim,et al.  A minimum mean square error approach for speech enhancement , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[19]  Mitch Weintraub,et al.  Estimation of noise-corrupted speech DFT-spectrum using the pitch period , 1994, IEEE Trans. Speech Audio Process..

[20]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition , 1996 .

[21]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[22]  Li Deng,et al.  Large-vocabulary speech recognition under adverse acoustic environments , 2000, INTERSPEECH.

[23]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[24]  Nam-Soo Kim Nonstationary environment compensation based on sequential estimation , 1998 .

[25]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[26]  David J. C. MacKay,et al.  Choice of Basis for Laplace Approximation , 1998, Machine Learning.

[27]  Dennis R. Morgan,et al.  A multiresolution approach to blind separation of speech signals in a reverberant environment , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[28]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[29]  Biing-Hwang Juang,et al.  On the application of hidden Markov models for enhancing noisy speech , 1989, IEEE Trans. Acoust. Speech Signal Process..

[30]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[31]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[32]  Yan Ming Cheng,et al.  SNR-dependent waveform processing for improving the robustness of ASR front-end , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[33]  Pedro J. Moreno,et al.  Speech recognition in noisy environments , 1996 .

[34]  David G. Stork,et al.  Pattern Classification , 1973 .

[35]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[36]  Mazin G. Rahim,et al.  On second order statistics and linear estimation of cepstral coefficients , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[37]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[38]  Andrew K. Halberstadt Heterogeneous acoustic measurements and multiple classifiers for speech recognition , 1999 .

[39]  Yariv Ephraim,et al.  Statistical-model-based speech enhancement systems , 1992, Proc. IEEE.

[40]  Florent Perronnin,et al.  Very fast adaptation with a compact context-dependent eigenvoice model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[41]  Refractor Uncertainty , 2001, The Lancet.

[42]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[43]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[44]  Biing-Hwang Juang,et al.  Speech recognition in adverse environments , 1991 .

[45]  Alex Acero,et al.  Maximum a posteriori pitch tracking , 1998, ICSLP.

[46]  Brendan J. Frey,et al.  Learning Dynamic Noise Models from Noisy Speech for Robust Speech Recognition , 2001 .

[47]  Hervé Glotin,et al.  Weighting schemes for audio-visual fusion in speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[48]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[49]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[50]  Te-Won Lee,et al.  Learning statistically efficient features for speaker recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[51]  L. Siegel A procedure for using pattern classification techniques to obtain a voiced/Unvoiced classifier , 1979 .

[52]  A. Erell,et al.  Estimation using log-spectral-distance criterion for noise-robust speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[53]  Mark J. F. Gales,et al.  An improved approach to the hidden Markov model decomposition of speech and noise , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  Evandro B. Gouvêa,et al.  Cepstral compensation by polynomial approximation for environment-independent speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[55]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[56]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[57]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[58]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[59]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[60]  Mikael Nilsson,et al.  Speech Recognition using Hidden Markov Model , 2002 .

[61]  Alex Acero,et al.  Speech/noise separation using two microphones and a VQ model of speech signals , 2000, INTERSPEECH.

[62]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[63]  Hermann Ney,et al.  Using phase spectrum information for improved speech recognition performance , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[64]  B. Frey,et al.  ACCOUNTING FOR UNCERTAINTY IN OBSERVATIONS : A NEW PARADIGM FOR ROBUST AUTOMATIC SPEECH RECOGNITION , 2022 .

[65]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[66]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[67]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[68]  Biing-Hwang Juang,et al.  A family of distortion measures based upon projection operation for robust speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[69]  Brendan J. Frey,et al.  Noise robust speech recognition using Gaussian basis functions for non-linear likelihood function approximation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[70]  Eric Horvitz,et al.  Uncertainty, Utility, and Misunderstanding: A Decision-Theoretic Perspective on Grounding in Conversational Systems , 1999 .

[71]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[72]  Olivier Siohan,et al.  Sequential noise estimation with optimal forgetting for robust speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[73]  Chin-Hui Lee,et al.  Hierarchical stochastic feature matching for robust speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[74]  Te-Won Lee,et al.  The statistical structures of male and female speech signals , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[75]  Mark J. F. Gales,et al.  PMC for speech recognition in additive and convolutional noise , 1993 .

[76]  Mark J. F. Gales,et al.  A fast and flexible implementation of parallel model combination , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[77]  Brendan J. Frey,et al.  Accounting for uncertainity in observations: A new paradigm for Robust Automatic Speech Recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[78]  Li Deng,et al.  Speech Denoising and Dereverberation Using Probabilistic Models , 2000, NIPS.

[79]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .