Automatic speech recognition in adverse acoustic conditions

For improved recognition robustness in mismatched training-test conditions, the application of key ideas from Missing Feature Theory and Robust Statistical Pattern Recognition in the framework of an otherwise conventional ASR system were investigated. To this end, both the type of features used to represent the speech signals and the algorithm used to compute the distance measure between an observed feature vector and a previously trained parametric model were studied. Two different types of feature representations were used: a type in which spectrally local distortions are smeared over the entire feature vector and a type in which distortions are only smeared over part of the feature vector. In addition, two different distance measures were investigated, viz., a conventional distance measure and a robust local distance function in the form of Acoustic Backing-off. The effects on recognition performance were studied for artificially created, band-limited noise and NOISEX noise added to the speech signals. The results for artificial band-limited noise indicate that a partially smearing feature transform is to be preferred over a fully smearing transform. In addition, for artificial band-limited noise, a robust local distance function is to be preferred over the conventional distance measure as long as the distorted feature values are outliers with respect to the feature distribution observed during training. The experiments with NOISEX noise show that the combination of feature type and distance measure that is optimal for artificial, band-limited noise is also capable to improve recognition robustness for NOISEX noise, provided that the noise is band-limited.

[1]  Bert Cranen,et al.  Acoustic pre-processing for optimal effectivity of missing feature theory , 1999, EUROSPEECH.

[2]  Esa Saarinen,et al.  Imagologies: Media Philosophy , 1994 .

[3]  Philip N. Garner,et al.  On the robust incorporation of formant features into hidden Markov models for automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Mark J. F. Gales Predictive model-based compensation schemes for robust speech recognition , 1998, Speech Commun..

[5]  Hermann Ney,et al.  The Philips Research system for continuous-speech recognition , 1992 .

[6]  Richard R. John Media Technology and Society: A History: From the Telegraph to the Internet (review) , 2000, Journal of Interdisciplinary History.

[7]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[8]  M. J. Hunt,et al.  An investigation of PLP and IMELDA acoustic representations and of their potential for combination , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Fred. D. Minifie Normal aspects of speech, hearing, and language , 1973 .

[10]  Philip N. Garner,et al.  Using formant frequencies in speech recognition , 1997, EUROSPEECH.

[11]  Lou Boves,et al.  Acoustic features and a distance measure that reduce the impact of training-test mismatch in ASR , 2001, Speech Commun..

[12]  Steven Kay,et al.  The effects of noise on the autoregressive spectral estimator , 1979 .

[13]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[14]  Lou Boves,et al.  Acoustic backing-off as an implementation of missing feature theory , 2001, Speech Commun..

[15]  Richard M. Stern,et al.  Speech recognition from GSM codec parameters , 1998, ICSLP.

[16]  Samy Bengio,et al.  HMM2- a novel approach to HMM emission probability estimation , 2000, INTERSPEECH.

[17]  Lou Boves,et al.  Acoustic backing-off in the local distance computation for robust automatic speech recognition , 1998, ICSLP.

[18]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[19]  Wendy J. Holmes Segmental HMMs: Modelling dynamics and underlying structure for automatic speech recognition , 2000 .

[20]  Hervé Bourlard,et al.  Robust Speech Recognition based on Multi-Stream Features , 1997 .

[21]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[22]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[23]  L. R. Rabiner,et al.  The effects of selected signal processing techniques on the performance of a filter-bank-based isolated word recognizer , 1983, The Bell System Technical Journal.

[24]  Hermann Ney,et al.  A model for efficient formant estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[25]  Hynek Hermansky,et al.  Sub-band based recognition of noisy speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  J Hillenbrand,et al.  Vowel classification based on fundamental frequency and formant frequencies. , 1987, Journal of speech and hearing research.

[27]  Jean-Claude Junqua,et al.  On the Use of a Robust Speech Representation , 1996 .

[28]  Climent Nadeu,et al.  On the decorrelation of filter-bank energies in speech recognition , 1995, EUROSPEECH.

[29]  Chafic Mokbel,et al.  Compensation of telephone line effects for robust speech recognition , 1994, ICSLP.

[30]  Yurij Kharin Robustness in Statistical Pattern Recognition , 1996 .

[31]  C. Dewdney,et al.  The Skin of Culture: Investigating the New Electronic Reality , 1995 .

[32]  Lei Lf Willems Robust formant analysis , 1986 .

[33]  Hervé Bourlard,et al.  Speech recognition using advanced HMM2 features , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[34]  Lou Boves,et al.  The Dutch polyphone corpus , 1995, EUROSPEECH.

[35]  Samy Bengio,et al.  A Pragmatic View of the Application of HMM2 for ASR , 2001 .

[36]  Alexandros Potamianos,et al.  Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[37]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[38]  Lou Boves,et al.  Channel normalization techniques for automatic speech recognition over the telephone , 1998, Speech Commun..

[39]  Bert Cranen,et al.  MISSING FEATURE THEORY IN ASR: MAKE SURE YOU MISS THE RIGHT TYPE OF FEATURES , 1999 .

[40]  Bhaskar D. Rao,et al.  Techniques for capturing temporal variations in speech signals with fixed-rate processing , 1998, ICSLP.

[41]  R. Plomp,et al.  Perceptual and physical space of vowel sounds. , 1969, The Journal of the Acoustical Society of America.

[42]  Helmer Strik,et al.  Improving the performance of a Dutch CSR by modeling within-word and cross-word pronunciation variation , 1999, Speech Commun..

[43]  Chin-Hui Lee,et al.  On stochastic feature and model compensation approaches to robust speech recognition , 1998, Speech Commun..

[44]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[45]  Lou Boves,et al.  A comparison of LPC and FFT-based acoustic features for noise robust ASR , 2001, INTERSPEECH.

[46]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[47]  M. Basseville Distance measures for signal processing and pattern recognition , 1989 .

[48]  J. Hillenbrand,et al.  Acoustic characteristics of American English vowels. , 1994, The Journal of the Acoustical Society of America.

[49]  James R. Batt Science, Technology and Human Values. , 1975 .

[50]  Samy Bengio,et al.  HMM2- Extraction of Formant Features and their Use for Robust ASR , 2001 .

[51]  Phil D. Green,et al.  Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: an integrated study , 1999, EUROSPEECH.

[52]  Richard Lippmann,et al.  Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise KN-37 , 1997, EUROSPEECH.

[53]  Lou Boves,et al.  A spoken dialog system for the Dutch public transport information service , 1997, Int. J. Speech Technol..

[54]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[55]  Phil D. Green,et al.  Some solution to the missing feature problem in data classification, with application to noise robust ASR , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[56]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[57]  Lou Boves,et al.  Noise reduction for noise robust feature extraction for distributed speech recognition , 2001, INTERSPEECH.

[58]  Lou Boves,et al.  Feature vector selection to improve ASR robustness in noisy conditions , 2001, INTERSPEECH.

[59]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[60]  K. Zaveri,et al.  Acoustic Noise Measurements , 1988 .

[61]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[62]  Mukund Padmanabhan,et al.  A nonlinear unsupervised adaptation technique for speech recognition , 2000, INTERSPEECH.

[63]  Samy Bengio,et al.  HMM2- extraction of formant structures and their use for robust ASR , 2001, INTERSPEECH.

[64]  A. Cornelius Benjamin,et al.  Science, Technology, and Human Values , 1966 .