Augmenting standard speech recognition features with energy gravity centres

This paper describes an investigation on the possibility of adding new features to classical Mel Scaled Cepstral Coefficients (MFCC) and their time derivatives. A hybrid Automatic Speech Recognition (ASR) system is used based on a Neural Network (NN) and a collection of Hidden Markov Models (HMM). It is shown that the gravity centres (GC) of energies in the frequency bands of the first three formants and their first and second time derivatives can be added to the classical set of MFCCs and their first and second time derivatives, resulting in significant performance improvements. Nevertheless, in some cases, the added parameters may nave a negative effect on performance, because the parameters are reliable only for certain types of sounds as their values may exhibit large variations for the same sound in the presence of additive noise. Experiments have shown that one solution is that of introducing a reliability index indicating the importance the newly added parameters should have in describing a given frame. NNs appear to be suitable devices for taking this fact into account in the computation of observation probabilities. Experiments have also shown improvements when GCs are computed from zero-crossing intervals detected at the output of the filters of an ear model. Intensities are obtained by associating a nonlinear peak amplitude coding to each zero-crossing interval. Consistent improvements are observed when the above-mentioned solutions are applied with medium as well as large size lexicons in the presence of additive noise.

[1]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[2]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[3]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[4]  B. Kedem,et al.  Spectral analysis and discrimination by zero-crossings , 1986, Proceedings of the IEEE.

[5]  Russell J. Niederjohn,et al.  A zero-crossing consistency method for formant tracking of voiced speech in high noise levels , 1985, IEEE Trans. Acoust. Speech Signal Process..

[6]  Steven M. Kay,et al.  A zero crossing-based spectrum analyzer , 1986, IEEE Trans. Acoust. Speech Signal Process..

[7]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[8]  R. De Mori,et al.  A descriptive technique for automatic speech recognition , 1973 .

[9]  Harald Singer,et al.  Speaker normalized spectral subband parameters for noise robust speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  Shuji Doshita,et al.  The Automatic Speech Recognition System for Conversational Sound , 1963, IEEE Trans. Electron. Comput..

[11]  Oded Ghitza Auditory models and human performance in tasks related to speech coding and speech recognition , 1994 .

[12]  T. V. Sreenivas,et al.  Zero-crossing based spectral analysis and SVD spectral analysis for formant frequency estimation in noise , 1992, IEEE Trans. Signal Process..

[13]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .