Phoneme recognition using spectral envelope and modulation frequency features

We present a new feature extraction technique for phoneme recognition that uses short-term spectral envelope and modulation frequency features. These features are derived from sub-band temporal envelopes of speech estimated using Frequency Domain Linear Prediction (FDLP). While spectral envelope features are obtained by the short-term integration of the sub-band envelopes, the modulation frequency components are derived from the long-term evolution of the sub-band envelopes. These features are combined at the phoneme posterior level and used as features for a hybrid HMM-ANN phoneme recognizer. For the phoneme recognition task on the TIMIT database, the proposed features show an improvement of 4.7% over the other feature extraction techniques.

[1]  Jr. S. Marple,et al.  Computing the discrete-time 'analytic' signal via FFT , 1999, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[2]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[3]  James David Johnston,et al.  Enhancing the Performance of Perceptual Audio Coders by Using Temporal Noise Shaping (TNS) , 1996 .

[4]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[5]  T Dau,et al.  A quantitative model of the "effective" signal processing in the auditory system. I. Model structure. , 1996, The Journal of the Acoustical Society of America.

[6]  Chi-Min Liu,et al.  Autoregressive Modeling of Temporal/Spectral Envelopes With Finite-Length Discrete Trigonometric Transforms , 2010, IEEE Transactions on Signal Processing.

[7]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[8]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[9]  Hynek Hermansky,et al.  Exploiting contextual information for improved phoneme recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[11]  L. Finch A hybrid approach , 1998 .

[12]  S. R. Mahadeva Prasanna,et al.  Analysis of Confusion Matrix to Combine Evidence for Phoneme Recognition , 2007 .

[13]  Fabio Valente,et al.  Combination of Acoustic Classifiers Based on Dempster-Shafer Theory of Evidence , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[15]  H. Hermansky,et al.  The modulation spectrum in the automatic recognition of speech , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[16]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.