论文信息 - Robust speech recognition based on spectro-temporal processing

Robust speech recognition based on spectro-temporal processing

In this thesis, novelle spectro-temporal feature extraction techniques are evaluated for enhancing the robustness of automatic speech recognition systems (ASR) in adverse acoustical conditions. Recent physiological and psychoacoustical findings indicate that spectro-temporal processing plays an important role in human speech perception. Therefore, sigma-pi cells and Gabor filter functions are investigated as secondary feature extraction methods based on spectro-temporal representation. Especially the Gabor features are versatile enough to include cepstral features and purely temporal filtering as special cases, while additionally aiming at combined spectro-temporal modulations. A data driven feature selection method is applied for feature set optimization. For small vocabularies, both types of features are shown to increase the robustness of ASR systems. Sigma-pi cells also allow for estimating the speech-to-noise ratio of an input signal solely based on low spectro-temporal modulation. The Gabor based Tandem feature sets increase the performance of the Qualcomm-ICSI-OGI system for the Aurora task, when concatenating the two streams.

Michael Kleinschmidt | M. Kleinschmidt

[1] Kuansan Wang,et al. Self-normalization and noise-robustness in early auditory representations , 1994, IEEE Trans. Speech Audio Process..

[2] Ron Kohavi,et al. Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[3] N. I. Durlach,et al. Binaural signal detection - Equalization and cancellation theory. , 1972 .

[4] Jont B. Allen,et al. How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[5] T.,et al. Training Feedforward Networks with the Marquardt Algorithm , 2004 .

[6] C. Schreiner,et al. Spectral envelope coding in cat primary auditory cortex: linear and non‐linear effects of stimulus characteristics , 1998, The European journal of neuroscience.

[7] S. Shamma,et al. Analysis of dynamic spectra in ferret primary auditory cortex. II. Prediction of unit responses to arbitrary dynamic spectra. , 1996, Journal of neurophysiology.

[8] J Tchorz,et al. A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[9] Q. Summerfield. Book Review: Auditory Scene Analysis: The Perceptual Organization of Sound , 1992 .

[10] Birger Kollmeier,et al. Using a quantitative psychoacoustical signal representation for objective speech quality measurement , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11] Olivier Cappé,et al. Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor , 1994, IEEE Trans. Speech Audio Process..

[12] Climent Nadeu,et al. Time and frequency filtering of filter-bank energies for robust HMM speech recognition , 2000, Speech Commun..

[13] Herbert Reininger,et al. Exploiting the potential of auditory preprocessing for robust speech recognition by locally recurrent neural networks , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14] H Müsch,et al. Using statistical decision theory to predict speech intelligibility. II. Measurement and prediction of consonant-discrimination performance. , 2001, The Journal of the Acoustical Society of America.

[15] B Kollmeier,et al. Speech intelligibility prediction in hearing-impaired listeners based on a psychoacoustically motivated perception model. , 1996, The Journal of the Acoustical Society of America.

[16] Hynek Hermansky,et al. Towards increasing speech recognition error rates , 1995, Speech Commun..

[17] Herbert Reininger,et al. Evaluation of PEMO in robust speech recognition , 1999 .

[18] iirgen Tcharz. Noise suppression based on neurophysiologically-motivated SNR estimation for robust speech recognition , 2000 .

[19] Andrew C. Morris,et al. A comparison of two strategies for ASR in additive noise: missing data and spectral subtraction , 1999, EUROSPEECH.

[20] Birger Kollmeier,et al. Combining Monaural Noise Reduction Algorithms and Perceptive Preprocessing for Robust Speech Recognition , 1999 .

[21] Birger Kollmeier,et al. Speech detection and SNR prediction basing on amplitude modulation pattern recognition , 1999, EUROSPEECH.

[22] Hervé Bourlard,et al. Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions , 1997, Summer School on Neural Networks.

[23] R. M. Warren,et al. Intelligibility of 1/3-octave speech: greater contribution of frequencies outside than inside the nominal passband. , 1999, The Journal of the Acoustical Society of America.

[24] Birger Kollmeier,et al. Combining speech enhancement and auditory feature extraction for robust speech recognition , 2000, Speech Commun..

[25] B Kollmeier,et al. Directivity of binaural noise reduction in spatial multiple noise-source arrangements for normal and impaired listeners. , 1997, The Journal of the Acoustical Society of America.

[26] David Pearce,et al. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[27] Samy Bengio,et al. HMM2- a novel approach to HMM emission probability estimation , 2000, INTERSPEECH.

[28] T. Gramss. Fast algorithms to find invariant features for a word recognizing neural net , 1991 .

[29] H. Wust,et al. A speech recognizer with low complexity based on RNN , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[30] Hynek Hermansky,et al. Qualcomm-ICSI-OGI features for ASR , 2002, INTERSPEECH.

[31] Mirjam Wester,et al. An elitist approach to articulatory-acoustic feature classification , 2001, INTERSPEECH.

[32] Yifan Gong,et al. Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[33] Jean-Claude Junqua,et al. Techniques for robust speech recognition in the car environment , 1999, EUROSPEECH.

[34] Hans Werner Strube,et al. Noise reduction for speech signals by operations on the modulation frequency spectrum , 1999 .

[35] Richard Lippmann,et al. Speech recognition by machines and humans , 1997, Speech Commun..

[36] Birger Kollmeier,et al. Noise reduction strategies employing interaural parameters , 1999 .

[37] Katsuhiko Shirai,et al. Speech recognition in nonstationary noise based on parallel HMMs and spectral subtraction , 1996, Systems and Computers in Japan.

[38] Rainer Martin,et al. An efficient algorithm to estimate the instantaneous SNR of speech signals , 1993, EUROSPEECH.

[39] Karl-Dirk Kammeyer,et al. MULTI-MICROPHONE NOISE REDUCTION TECHNIQUES FOR HANDS-FR EE SPEECH RECOGNITION -A COMPARATIVE STUDY- , 1999 .

[40] Hynek Hermansky,et al. Robust ASR front-end using spectral-based and discriminant features: experiments on the Aurora tasks , 2001, INTERSPEECH.

[41] Khalid Choukri,et al. SPEECHDAT-CAR. A Large Speech Database for Automotive Environments , 2000, LREC.

[42] Daniel P. W. Ellis,et al. Improved recognition by combining different features and different systems , 2000 .

[43] Christophe Ris,et al. Assessing local noise level estimation methods: Application to noise robust ASR , 2000, Speech Commun..

[44] Birger Kollmeier,et al. Objective Modeling of Speech Quality with a Psychoacoustically Validated Auditory Model , 2000 .

[45] Birger Kollmeier,et al. AUDITORY FEATURE EXTRACTION AND RECOGNIZER DEPENDENCIES , 1999 .

[46] B. Moore,et al. Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[47] Hans-Günter Hirsch,et al. Noise estimation techniques for robust speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[48] Birger Kollmeier,et al. Combination of monaural and binaural noise suppression algorithms and its use for the hearing impaired , 1999 .

[49] R. G. Leonard,et al. A database for speaker-independent digit recognition , 1984, ICASSP.

[50] H. Hermansky,et al. Adaptive speech enhancement using frequency-specific SNR estimates , 1996, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications.

[51] B. Kollmeier,et al. Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. , 1994, The Journal of the Acoustical Society of America.

[52] Timothy R. Anderson,et al. Binaural phoneme recognition using the auditory image model and cross-correlation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53] Jörg Meyer,et al. Multi-channel speech enhancement in a car environment using Wiener filtering and spectral subtraction , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54] TchorzJürgen,et al. Estimation of the signal-to-noise ratio with amplitude modulation spectrograms , 2002 .

[55] Maurizio Omologo,et al. Microphone array based speech recognition with different talker-array positions , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.