Time and frequency filtering of filter-bank energies for robust HMM speech recognition

Abstract Every speech recognition system requires a signal representation that parametrically models the temporal evolution of the speech spectral envelope. Current parameterizations involve, either explicitly or implicitly, a set of energies from frequency bands which are often distributed in a mel scale. The computation of those energies is performed in diverse ways, but it always includes smoothing of basic spectral measurements and non-linear amplitude compression. Several linear transformations are then applied to the two-dimensional time-frequency sequence of energies before entering the HMM pattern matching stage. In this paper, a recently introduced technique that consists of filtering that sequence of energies along the frequency dimension is presented, and its resulting parameters are compared with the widely used cepstral coefficients. Then, that frequency filtering transformation is jointly considered with the time filtering transformation that is used to compute dynamic parameters, showing that the flexibility of this combined (tiffing) approach can be used to design a robust set of filters. Recognition experiment results are reported which show the potential of tiffing for an enhanced and more robust HMM speech recognition.

[1]  Patrice Alexandre,et al.  Root cepstral analysis: A unified view. Application to speech processing in car noise environments , 1993, Speech Commun..

[2]  D. Thomson,et al.  Spectrum estimation and harmonic analysis , 1982, Proceedings of the IEEE.

[3]  Climent Nadeu,et al.  Robust speech parameters located in the frequency domain , 1997, EUROSPEECH.

[4]  Climent Nadeu,et al.  Optimization algorithms for estimating modulation spectrum domain filters , 1999, EUROSPEECH.

[5]  Bishnu S. Atal,et al.  Efficient coding of LPC parameters by temporal decomposition , 1983, ICASSP.

[6]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition , 1996 .

[7]  R. Haddad,et al.  Multiresolution Signal Decomposition: Transforms, Subbands, and Wavelets , 1992 .

[8]  Biing-Hwang Juang,et al.  Filtering of spectral parameters for speech recognition , 1994, ICSLP.

[9]  Jean-Claude Junqua,et al.  Spectral Dynamics for Speech Recognition Under Adverse Conditions , 1996 .

[10]  Climent Nadeu,et al.  Frequency averaging: an useful multiwindow spectral analysis approach , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Y. Tohkura,et al.  A weighted cepstral distance measure for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  José A. R. Fonollosa,et al.  Feature decorrelation methods in speech recognition. a comparative study , 1998, ICSLP.

[13]  Hynek Hermansky,et al.  Should recognizers have ears? , 1998, Speech Commun..

[14]  Yoh'ichi Tohkura,et al.  A weighted cepstral distance measure for speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[15]  S. Vaseghi,et al.  Speech modelling using cepstral-time feature matrices in hidden Markov models , 1993 .

[16]  Dennis H. Klatt,et al.  Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.

[17]  Bert Cranen,et al.  MISSING FEATURE THEORY IN ASR: MAKE SURE YOU MISS THE RIGHT TYPE OF FEATURES , 1999 .

[18]  Climent Nadeu,et al.  On the interaction between time and frequency filtering of speech parameters for robust speech recognition , 1998, ICSLP.

[19]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[20]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[21]  Aaron E. Rosenberg,et al.  Cepstral channel normalization techniques for HMM-based speaker verification , 1994, ICSLP.

[22]  Misha Pavel,et al.  Intelligibility of speech with filtered time trajectories of spectral envelopes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[23]  Albino Nogueiras,et al.  Frequency and time filtering of filter-bank energies for HMM speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[24]  Climent Nadeu,et al.  CDHMM speaker recognition by means of frequency filtering of filter-bank energies , 1997, EUROSPEECH.

[25]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[26]  Brian A. Hanson,et al.  Spectral slope distance measures with linear prediction analysis for word recognition in noise , 1987, IEEE Trans. Acoust. Speech Signal Process..

[27]  Hsiao-Chuan Wang,et al.  A study of the two-dimensional cepstrum approach for speech recognition , 1992 .

[28]  Biing-Hwang Juang,et al.  Filtering the time sequences of spectral parameters for speech recognition, , 1997, Speech Commun..

[29]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[30]  C. Lefebvre,et al.  A comparison of several acoustic representations for speech recognition with degraded and undegraded speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[31]  Andrej Ljolje The importance of cepstral parameter correlations in speech recognition , 1994, Comput. Speech Lang..

[32]  Harvey F. Silverman,et al.  A parametrically controlled spectral analysis system for speech , 1974 .

[33]  Kuldip K. Paliwal,et al.  On the performance of the quefrency-weighted cepstral coefficients in vowel recognition , 1982, Speech Commun..

[34]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[35]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[37]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[38]  Sarel van Vuuren,et al.  Data based filter design for RASTA-like channel normalization in ASR , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[39]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[40]  Biing-Hwang Juang,et al.  On the use of bandpass liftering in speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[41]  Biing-Hwang Juang,et al.  Signal bias removal by maximum likelihood estimation for robust telephone speech recognition , 1996, IEEE Trans. Speech Audio Process..

[42]  Climent Nadeu,et al.  On frequency averaging for spectral analysis in speech recognition , 1998, ICSLP.

[43]  T. Houtgast,et al.  A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria , 1985 .

[44]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[45]  Climent Nadeu,et al.  TIME AND FREQUENCY FILTERING FOR SPEECH RECOGNITION IN REAL NOISE CONDITIONS , 2001 .

[46]  H.F. Silverman,et al.  Analysis of LPC/DFT features for an HMM-based alphadigit recognizer , 1996, IEEE Signal Processing Letters.

[47]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[48]  Climent Nadeu,et al.  Comparison of time & frequency filtering and cepstral-time matrix approaches in ASR , 1999, EUROSPEECH.

[49]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[50]  Climent Nadeu,et al.  On the decorrelation of filter-bank energies in speech recognition , 1995, EUROSPEECH.

[51]  K. K. Paliwal,et al.  FILTER-BANK ENERGIES FOR ROBUST SPEECH RECOGNITION , 1999 .