Cepstral normalisation and the signal to noise ratio spectrum in automatic speech recognition

Cepstral normalisation in automatic speech recognition is investigated in the context of robustness to additive noise. In this paper, it is argued that such normalisation leads naturally to a speech feature based on signal to noise ratio rather than absolute energy (or power). Explicit calculation of this SNR-cepstrum by means of a noise estimate is shown to have theoretical and practical advantages over the usual (energy based) cepstrum. The relationship between the SNR-cepstrum and the articulation index, known in psycho-acoustics, is discussed. Experiments are presented suggesting that the combination of the SNR-cepstrum with the well known perceptual linear prediction method can be beneficial in noisy environments.

[1]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[2]  Philip N. Garner SNR features for automatic speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[3]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[4]  Christophe Ris,et al.  Assessing local noise level estimation methods: Application to noise robust ASR , 2000, Speech Commun..

[5]  Jont B. Allen Consonant recognition and the articulation index. , 2005, Journal of the Acoustical Society of America.

[6]  Yifan Gong,et al.  High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[7]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[8]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[9]  Lukás Burget,et al.  The AMIDA 2009 meeting transcription system , 2010, INTERSPEECH.

[10]  Lukás Burget,et al.  The AMI Meeting Transcription System: Progress and Performance , 2006, MLMI.

[11]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[12]  L. Deng Ieee Transactions on Speech and Audio Processing, Speech Trajectory Discrimination Using the Minimum Classiication Error Learning , 1997 .

[13]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  S. S. Stevens On the psychophysical law. , 1957, Psychological review.

[15]  Jeff Siu-Kei Au-Yeung,et al.  Improved performance of Aurora 4 using HTK and unsupervised MLLR adaptation , 2004, INTERSPEECH.

[16]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[17]  Dirk Van Compernolle Noise adaptation in a hidden Markov model speech recognition system , 1989 .

[18]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[19]  Pascal Scalart,et al.  A two-step noise reduction technique , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Guillaume Lathoud Channel Normalization for Unsupervised Spectral Subtraction , 2006 .

[21]  Mark Hasegawa-Johnson,et al.  Human speech perception and feature extraction , 2008, INTERSPEECH.

[22]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[23]  H. Bourlard,et al.  Unsupervised spectral subtraction for noise-robust ASR , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[24]  Saeed Vaseghi,et al.  Speech recognition in noisy environments , 1992, ICSLP.

[25]  Antonio J. Rubio,et al.  Feature extraction combining spectral noise reduction and cepstral histogram equalization for robust ASR , 2002, INTERSPEECH.

[26]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[27]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[28]  Naveen Parihar,et al.  Performance analysis of the Aurora large vocabulary baseline system , 2004, 2004 12th European Signal Processing Conference.

[29]  Philip N. Garner,et al.  Tracter: a lightweight dataflow framework , 2010, INTERSPEECH.

[30]  Israel Cohen,et al.  Relaxed statistical model for speech enhancement and a priori SNR estimation , 2005, IEEE Transactions on Speech and Audio Processing.

[31]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[32]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[33]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[34]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .