Robust speech recognition based on spectro-temporal processing

In this thesis, novelle spectro-temporal feature extraction techniques are evaluated for enhancing the robustness of automatic speech recognition systems (ASR) in adverse acoustical conditions. Recent physiological and psychoacoustical findings indicate that spectro-temporal processing plays an important role in human speech perception. Therefore, sigma-pi cells and Gabor filter functions are investigated as secondary feature extraction methods based on spectro-temporal representation. Especially the Gabor features are versatile enough to include cepstral features and purely temporal filtering as special cases, while additionally aiming at combined spectro-temporal modulations. A data driven feature selection method is applied for feature set optimization. For small vocabularies, both types of features are shown to increase the robustness of ASR systems. Sigma-pi cells also allow for estimating the speech-to-noise ratio of an input signal solely based on low spectro-temporal modulation. The Gabor based Tandem feature sets increase the performance of the Qualcomm-ICSI-OGI system for the Aurora task, when concatenating the two streams.

[1]  Kuansan Wang,et al.  Self-normalization and noise-robustness in early auditory representations , 1994, IEEE Trans. Speech Audio Process..

[2]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[3]  N. I. Durlach,et al.  Binaural signal detection - Equalization and cancellation theory. , 1972 .

[4]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[5]  T.,et al.  Training Feedforward Networks with the Marquardt Algorithm , 2004 .

[6]  C. Schreiner,et al.  Spectral envelope coding in cat primary auditory cortex: linear and non‐linear effects of stimulus characteristics , 1998, The European journal of neuroscience.

[7]  S. Shamma,et al.  Analysis of dynamic spectra in ferret primary auditory cortex. II. Prediction of unit responses to arbitrary dynamic spectra. , 1996, Journal of neurophysiology.

[8]  J Tchorz,et al.  A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[9]  Q. Summerfield Book Review: Auditory Scene Analysis: The Perceptual Organization of Sound , 1992 .

[10]  Birger Kollmeier,et al.  Using a quantitative psychoacoustical signal representation for objective speech quality measurement , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Olivier Cappé,et al.  Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor , 1994, IEEE Trans. Speech Audio Process..

[12]  Climent Nadeu,et al.  Time and frequency filtering of filter-bank energies for robust HMM speech recognition , 2000, Speech Commun..

[13]  Herbert Reininger,et al.  Exploiting the potential of auditory preprocessing for robust speech recognition by locally recurrent neural networks , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  H Müsch,et al.  Using statistical decision theory to predict speech intelligibility. II. Measurement and prediction of consonant-discrimination performance. , 2001, The Journal of the Acoustical Society of America.

[15]  B Kollmeier,et al.  Speech intelligibility prediction in hearing-impaired listeners based on a psychoacoustically motivated perception model. , 1996, The Journal of the Acoustical Society of America.

[16]  Hynek Hermansky,et al.  Towards increasing speech recognition error rates , 1995, Speech Commun..

[17]  Herbert Reininger,et al.  Evaluation of PEMO in robust speech recognition , 1999 .

[18]  iirgen Tcharz Noise suppression based on neurophysiologically-motivated SNR estimation for robust speech recognition , 2000 .

[19]  Andrew C. Morris,et al.  A comparison of two strategies for ASR in additive noise: missing data and spectral subtraction , 1999, EUROSPEECH.

[20]  Birger Kollmeier,et al.  Combining Monaural Noise Reduction Algorithms and Perceptive Preprocessing for Robust Speech Recognition , 1999 .

[21]  Birger Kollmeier,et al.  Speech detection and SNR prediction basing on amplitude modulation pattern recognition , 1999, EUROSPEECH.

[22]  Hervé Bourlard,et al.  Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions , 1997, Summer School on Neural Networks.

[23]  R. M. Warren,et al.  Intelligibility of 1/3-octave speech: greater contribution of frequencies outside than inside the nominal passband. , 1999, The Journal of the Acoustical Society of America.

[24]  Birger Kollmeier,et al.  Combining speech enhancement and auditory feature extraction for robust speech recognition , 2000, Speech Commun..

[25]  B Kollmeier,et al.  Directivity of binaural noise reduction in spatial multiple noise-source arrangements for normal and impaired listeners. , 1997, The Journal of the Acoustical Society of America.

[26]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[27]  Samy Bengio,et al.  HMM2- a novel approach to HMM emission probability estimation , 2000, INTERSPEECH.

[28]  T. Gramss Fast algorithms to find invariant features for a word recognizing neural net , 1991 .

[29]  H. Wust,et al.  A speech recognizer with low complexity based on RNN , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[30]  Hynek Hermansky,et al.  Qualcomm-ICSI-OGI features for ASR , 2002, INTERSPEECH.

[31]  Mirjam Wester,et al.  An elitist approach to articulatory-acoustic feature classification , 2001, INTERSPEECH.

[32]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[33]  Jean-Claude Junqua,et al.  Techniques for robust speech recognition in the car environment , 1999, EUROSPEECH.

[34]  Hans Werner Strube,et al.  Noise reduction for speech signals by operations on the modulation frequency spectrum , 1999 .

[35]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[36]  Birger Kollmeier,et al.  Noise reduction strategies employing interaural parameters , 1999 .

[37]  Katsuhiko Shirai,et al.  Speech recognition in nonstationary noise based on parallel HMMs and spectral subtraction , 1996, Systems and Computers in Japan.

[38]  Rainer Martin,et al.  An efficient algorithm to estimate the instantaneous SNR of speech signals , 1993, EUROSPEECH.

[39]  Karl-Dirk Kammeyer,et al.  MULTI-MICROPHONE NOISE REDUCTION TECHNIQUES FOR HANDS-FR EE SPEECH RECOGNITION -A COMPARATIVE STUDY- , 1999 .

[40]  Hynek Hermansky,et al.  Robust ASR front-end using spectral-based and discriminant features: experiments on the Aurora tasks , 2001, INTERSPEECH.

[41]  Khalid Choukri,et al.  SPEECHDAT-CAR. A Large Speech Database for Automotive Environments , 2000, LREC.

[42]  Daniel P. W. Ellis,et al.  Improved recognition by combining different features and different systems , 2000 .

[43]  Christophe Ris,et al.  Assessing local noise level estimation methods: Application to noise robust ASR , 2000, Speech Commun..

[44]  Birger Kollmeier,et al.  Objective Modeling of Speech Quality with a Psychoacoustically Validated Auditory Model , 2000 .

[45]  Birger Kollmeier,et al.  AUDITORY FEATURE EXTRACTION AND RECOGNIZER DEPENDENCIES , 1999 .

[46]  B. Moore,et al.  Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[47]  Hans-Günter Hirsch,et al.  Noise estimation techniques for robust speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[48]  Birger Kollmeier,et al.  Combination of monaural and binaural noise suppression algorithms and its use for the hearing impaired , 1999 .

[49]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[50]  H. Hermansky,et al.  Adaptive speech enhancement using frequency-specific SNR estimates , 1996, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications.

[51]  B. Kollmeier,et al.  Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. , 1994, The Journal of the Acoustical Society of America.

[52]  Timothy R. Anderson,et al.  Binaural phoneme recognition using the auditory image model and cross-correlation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Jörg Meyer,et al.  Multi-channel speech enhancement in a car environment using Wiener filtering and spectral subtraction , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  TchorzJürgen,et al.  Estimation of the signal-to-noise ratio with amplitude modulation spectrograms , 2002 .

[55]  Maurizio Omologo,et al.  Microphone array based speech recognition with different talker-array positions , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.