Nonlinear feature transformations for noise robust speech recognition

Robustness against external noise is an important requirement for automatic speech recognition (ASR) systems, when it comes to deploying them for practical applications. This thesis proposes and evaluates new feature-based approaches for improving the ASR noise robustness. These approaches are based on nonlinear transformations that, when applied to the spectrum or feature, aim to emphasize the part of the speech that is relatively more invariant to noise and/or deemphasize the part that is more sensitive to noise. Spectral peaks constitute high signal-to-noise ratio part of the speech. Thus an efficient parameterization of the components only from the peak locations is expected to improve the noise robustness. An evaluation of this requires estimation of the peak locations. Two methods proposed in this thesis for the peak estimation task are: 1) frequency-based dynamic programming (DP) algorithm, that uses the spectral slope values of single time frame, and 2) HMM/ANN based algorithm, that uses distinct time-frequency (TF) patterns in the spectrogram (thus imposing temporal constraints during the peak estimation). The learning of the distinct TF patterns in an unsupervised manner makes the HMM/ANN based algorithm sensitive to energy fluctuations in the TF patterns, which is not the case with frequency-based DP algorithm. For an efficient parameterization of spectral components around the peak locations, parameters describing activity pattern (energy surface) within local TF patterns around the spectral peaks are computed and used as features. These features, referred to as spectro-temporal activity pattern (STAP) features, show improved noise robustness, however they are inferior to the standard features in clean speech. The main reason for this is the complete masking of the non-peak regions in the spectrum, which also carry significant information required for clean speech recognition. This leads to a development of a new approach that utilizes a soft-masking procedure instead of discarding the non-peak spectral components completely. In this approach, referred to as phase autocorrelation (PAC) approach, the noise robustness is actually addressed in the autocorrelation domain (time-domain Fourier equivalent of the power spectral domain). It uses phase (i.e., angle) variation of the signal vector over time as a measure of correlation, as opposed to the regular autocorrelation which uses dot product. This alternative measure of autocorrelation is referred to as PAC, and is motivated by the fact that angle gets less disturbed by the additive disturbances than the dot product. Interestingly, the use of PAC has an effect of emphasizing the peaks and smoothing out the valleys, in the spectral domain, without explicitly estimating the peak locations. PAC features exhibit improved noise robustness. However, even the soft masking strategy tends to degrade the clean speech recognition performance. This points to the fact that externally designed transformations, which do not take a complete account of underlying complexity of the speech signal, may not be able to improve the robustness without hurting the clean speech recognition. A better approach in this case will be to learn the transformation from the speech data itself in a data-driven manner, compromising between improving the noise robustness while keeping the clean performance intact. An existing data-driven approach called TANDEM is analyzed to validate this. In TANDEM approach, a multi-layer perceptron (MLP) used to perform a data-driven transformation of the input features, learns the transformation by getting trained in a supervised, discriminative mode, with phoneme labels as output classes. Such a training makes the MLP to perform a nonlinear discriminant analysis in the input feature space and thus makes it to learn a transformation that projects the input features onto a sub-space of maximum class discriminatory information. This projection is able to suppress the noise related variability, while keeping the speech discriminatory information intact. An experimental evaluation of the TANDEM approach shows that it is effective in improving the noise robustness. Interestingly, TANDEM approach is able to further improves the noise robustness of the STAP and PAC features, and also improve their clean speech recognition performance. The analysis of noise robustness of TANDEM has also lead to another interesting aspect of it namely, using it as an integration tool for adaptively combining multiple feature streams. The validity of the various noise robust approaches developed in this thesis is shown by evaluating them on OGI Numbers95 database added with noises from Noisex92, and also with Aurora-2 database. A combination of robust features developed in this thesis along with standard features, in a TANDEM framework, result in a system that is reasonably robust in all conditions.

[1]  Hynek Hermansky,et al.  Nonlinear spectral transformations for robust speech recognition , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[2]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[3]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[4]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[5]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[6]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[7]  Hynek Hermansky TRAP-TANDEM: data-driven extraction of temporal features from speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[8]  Hervé Bourlard,et al.  Spectro-temporal activity pattern (STAP) features for noise robust ASR , 2004, INTERSPEECH.

[9]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[10]  K. Shikano Improvement of word recognition results by trigram model , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  H. Nyquist,et al.  Certain Topics in Telegraph Transmission Theory , 1928, Transactions of the American Institute of Electrical Engineers.

[12]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[13]  Harvey Fletcher,et al.  Speech and hearing. , 1930, Health services manager.

[14]  Hervé Bourlard,et al.  Speaker normalization using HMM2 , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[15]  S. McCandless,et al.  An algorithm for automatic formant extraction using linear prediction spectra , 1974 .

[16]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[17]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[18]  Hermann Ney,et al.  Formant estimation for speech recognition , 1998, IEEE Trans. Speech Audio Process..

[19]  Gary E. Kopec Formant tracking using hidden Markov models and vector quantization , 1986, IEEE Trans. Acoust. Speech Signal Process..

[20]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[21]  Denis Jouvet,et al.  Evaluation of a noise-robust DSR front-end on Aurora databases , 2002, INTERSPEECH.

[22]  Jérôme Boudy,et al.  Root homomorphic deconvolution schemes for speech processing in car noise environments , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[24]  Biing-Hwang Juang,et al.  A family of distortion measures base upon projection operation for robust speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[25]  Hynek Hermansky,et al.  On use of task independent training data in tandem feature extraction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  J. Lim Spectral root homomorphic deconvolution system , 1979 .

[27]  Alexandros Potamianos,et al.  Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[29]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[30]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[31]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[32]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[33]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[34]  Richard M. Stern,et al.  Inference of missing spectrographic features for robust speech recognition , 1998, ICSLP.

[35]  Biing-Hwang Juang,et al.  A family of distortion measures based upon projection operation for robust speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[36]  Katrin Weber,et al.  HMM Mixtures (HMM2) for Robust Speech Recognition , 2003 .

[37]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  P. Woodland,et al.  Flexible speaker adaptation using maximum likelihood linear regression , 1995 .

[39]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[40]  Andrew Varga,et al.  Control experiments on noise compensation in hidden Markov model based continuous word recognition , 1989, EUROSPEECH.

[41]  Brian Mellor,et al.  Noise masking in a transform domain , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Bayya Yegnanarayana,et al.  Speaker-specific mapping for text-independent speaker recognition , 2003, Speech Commun..

[43]  Alan V. Oppenheim,et al.  Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[44]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[45]  Astrid Hagen Robust speech recognition based on multi-stream processing , 2001 .

[46]  Sangita R. Sharma,et al.  Multi-stream approach to robust speech recognition , 1999 .

[47]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[48]  Samy Bengio,et al.  An EM Algorithm for HMMs with Emission Distributions Represented by HMMs , 2000 .

[49]  Samy Bengio,et al.  HMM2- a novel approach to HMM emission probability estimation , 2000, INTERSPEECH.

[50]  Hervé Bourlard,et al.  Phase autocorrelation (PAC) derived robust speech features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[51]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[52]  Richard M. Stern,et al.  Signal Processing for Robust Speech Recognition , 1994, HLT.

[53]  Yan Ming Cheng,et al.  SNR-dependent waveform processing for improving the robustness of ASR front-end , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[54]  Samy Bengio,et al.  A Pragmatic View of the Application of HMM2 for ASR , 2001 .

[55]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[56]  Hermann Ney,et al.  Continuous mixture densities and linear discriminant analysis for improved context-dependent acoustic models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[57]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[58]  J. Boudy,et al.  Non-linear spectral subtraction (NSS) and hidden Markov models for robust speech recognition in car noise environments , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[59]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[60]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[61]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[62]  Laurent Mauuary,et al.  Blind equalization in the cepstral domain for robust telephone based speech recognition , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[63]  Hervé Bourlard,et al.  MODELLING AUXILIARY FEATURES in TANDEM SYSTEMS , 2004 .

[64]  Richard M. Stern,et al.  Robust Speech Recognition: The case for restoring missing features , 2001 .

[65]  Daniel P. W. Ellis,et al.  Frequency-domain linear prediction for temporal features , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[66]  Richard M. Stern,et al.  COMPENSATION FOR ENVIRONMENTAL DEGRADATION IN AUTOMATIC SPEECH RECOGNITION , 1999 .

[67]  Samy Bengio,et al.  Robust speech recognition and feature extraction using HMM2 , 2003, Comput. Speech Lang..

[68]  Lou Boves,et al.  Noise reduction for noise robust feature extraction for distributed speech recognition , 2001, INTERSPEECH.

[69]  Hervé Glotin,et al.  Multi-stream adaptive evidence combination for noise robust ASR , 2001, Speech Commun..

[70]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[71]  Hervé Bourlard,et al.  Mel-cepstrum modulation spectrum (MCMS) features for robust ASR , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[72]  Abeer Alwan,et al.  Robust word recognition using threaded spectral peaks , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[73]  Hynek Hermansky,et al.  Phase autocorrelation (PAC) features in entropy based multi-stream for robust speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[74]  Bayya Yegnanarayana,et al.  Analysis of autoassociative mapping neural networks , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[75]  Samy Bengio,et al.  IDIAP HMM/HMM2 System: Theoretical Basis and Software Specifications , 2001 .

[76]  Juan Arturo Nolazco-Flores,et al.  Continuous speech recognition in noise using spectral subtraction and HMM adaptation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[77]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[78]  Patrice Alexandre,et al.  Root adaptive homomorphic deconvolution schemes for speech recognition in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[79]  Hervé Bourlard,et al.  Speech recognition with auxiliary information , 2004, IEEE Transactions on Speech and Audio Processing.

[80]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..