Auditory Features Revisited for Robust Speech Recognition

Auditory based front-ends for speech recognition have been compared before, but this paper focuses on two of the most promising algorithms for noise robustness in automatic speech recognition (ASR). The feature sets are Zero-Crossings with Peak Amplitudes (ZCPA) and the recently introduced Power-Law Nonlinearity and Power-Bias Subtraction (PNCC). Standard Mel-Frequency Cepstral Coefficients (MFCC) are also tested for reference. The performance of all features is reported on the TIMIT database using a HMM-based recogniser. It is found that the PNCC features outperform MFCC in clean conditions and are most robust to noise. ZCPA performance is shown to vary widely with filter bank configuration and frame length. The ZCPA performance is poor in clean conditions but is the least affected by white noise. PNCC is shown to be the most promising new feature set for robust ASR in recent years.

[1]  James R. Glass,et al.  Heterogeneous acoustic measurements for phonetic classification 1 , 1997, EUROSPEECH.

[2]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[3]  Richard M. Stern,et al.  Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction , 2009, INTERSPEECH.

[4]  Kuldip K. Paliwal,et al.  Robust speech recognition using features based on zero crossings with peak amplitudes , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Waleed H. Abdulla,et al.  Performance evaluation of front-end algorithms for robust speech recognition , 2005, Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005..

[6]  Oded Ghitza Auditory models and human performance in tasks related to speech coding and speech recognition , 1994 .

[7]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[8]  L. Carney,et al.  A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. , 2001, The Journal of the Acoustical Society of America.

[9]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[10]  James R. Glass,et al.  HETEROGENEOUS ACOUSTIC MEASUREMENTS FOR PHONETIC CLASSIFICATION , 1997 .

[11]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..