Feature normalization based on non-extensive statistics for speech recognition

Highlights? We propose a feature normalization method for robust speech recognition. ? It operates in a spectral domain intermediate between log and linear. ? We name our method q-logarithmic Spectral Mean Normalization (q-LSMN). ? It is based on non-extensive statistics in which additivity does not hold. ? It was better than CMN, MVN, and ETSI AFE in our experiments. Most compensation methods to improve the robustness of speech recognition systems in noisy environments such as spectral subtraction, CMN, and MVN, rely on the fact that noise and speech spectra are independent. However, the use of limited window in signal processing may introduce a cross-term between them, which deteriorates the speech recognition accuracy. To tackle this problem, we introduce the q-logarithmic (q-log) spectral domain of non-extensive statistics and propose q-log spectral mean normalization (q-LSMN) which is an extension of log spectral mean normalization (LSMN) to this domain. The recognition experiments on a synthesized noisy speech database, the Aurora-2 database, showed that q-LSMN was consistently better than the conventional normalization methods, CMN, LSMN, and MVN. Furthermore, q-LSMN was even more effective when applied to a real noisy environment in the CENSREC-2 database. It significantly outperformed ETSI AFE front-end.

[1]  Richard M. Stern,et al.  Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction , 2009, INTERSPEECH.

[2]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[3]  Anshu Agarwal,et al.  TWO-STAGE MEL-WARPED WIENER FILTER FOR ROBUST SPEECH RECOGNITION , 1999 .

[4]  G. Wilk,et al.  Application of nonextensive statistics to particle and nuclear physics , 2001 .

[5]  A. Olemskoi,et al.  Generalization of multifractal theory within quantum calculus , 2010, 1003.0124.

[6]  A. Plastino,et al.  Foundations of Nonextensive Statistical Mechanics and Its Cosmological Applications , 2004 .

[7]  N. Thakor,et al.  Time-Dependent Entropy Estimation of EEG Rhythm Changes Following Brain Ischemia , 2003, Annals of Biomedical Engineering.

[8]  Takao Kobayashi,et al.  Spectral analysis using generalized cepstrum , 1984 .

[9]  Laurent Mauuary,et al.  Blind equalization for robust telephone based speech recognition , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[10]  Li Deng,et al.  Enhancement of log Mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise , 2004, IEEE Transactions on Speech and Audio Processing.

[11]  M. Moret Self-organized critical model for protein folding , 2011 .

[12]  Francis Jack Smith,et al.  Speech recognition using a strong correlation assumption for the instantaneous spectra , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[14]  Nicholas W. D. Evans,et al.  An Assessment on the Fundamental Limitations of Spectral Subtraction , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[15]  Nikola Gradojevic,et al.  Financial Applications of Nonextensive Entropy [Applications Corner] , 2011, IEEE Signal Processing Magazine.

[16]  Gerhard Doblinger,et al.  Computationally efficient speech enhancement by spectral minima tracking in subbands , 1995, EUROSPEECH.

[17]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[18]  Shuichi Itahashi,et al.  Recent speech database projects in Japan , 1990, ICSLP.

[19]  Shubha Kadambe,et al.  A comparison of the existence of 'cross terms' in the Wigner distribution and the squared magnitude of the wavelet transform and the short-time Fourier transform , 1992, IEEE Trans. Signal Process..

[20]  Hans-Günter Hirsch,et al.  Noise estimation techniques for robust speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[21]  Ernesto P. Borges A possible deformed algebra and calculus inspired in nonextensive thermostatistics , 2003, cond-mat/0304545.

[22]  Israel Cohen,et al.  Relaxed statistical model for speech enhancement and a priori SNR estimation , 2005, IEEE Transactions on Speech and Audio Processing.

[23]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[24]  Chong Kwan Un,et al.  Speech recognition in noisy environments using first-order vector Taylor series , 1998, Speech Commun..

[25]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[26]  Hynek Hermansky,et al.  On the effects of short-term spectrum smoothing in channel normalization , 1997, IEEE Trans. Speech Audio Process..

[27]  Darryl Stewart,et al.  Subband correlation and robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[28]  Friedrich Faubel,et al.  A phase-averaged model for the relationship between noisy speech, clean speech and noise in the log-mel domain , 2008, INTERSPEECH.

[29]  Q. A. Wang,et al.  Generalized algebra within a nonextensive statistics , 2003, math-ph/0303061.

[30]  Satoshi Nakamura,et al.  CENSREC2: corpus and evaluation environments for in car continuous digit speech recognition , 2006, INTERSPEECH.

[31]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[32]  Qifeng Zhu,et al.  The effect of additive noise on speech amplitude spectra: a quantitative analysis , 2002, IEEE Signal Processing Letters.

[33]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[34]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[35]  B. Atal,et al.  Optimizing digital speech coders by exploiting masking properties of the human ear , 1978 .

[36]  Koichi Shinoda,et al.  Generalized-Log Spectral Mean Normalization for Speech Recognition , 2011, INTERSPEECH.

[37]  Shi Weili,et al.  Research of automatic medical image segmentation algorithm based on Tsallis entropy and improved PCNN , 2009, 2009 International Conference on Mechatronics and Automation.

[38]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[39]  Du Jiulin Nonextensivity and the power-law distributions for the systems with self-gravitating long-range interactions , 2007 .

[40]  Yoshihiro Ito,et al.  Forward masking on a generalized logarithmic scale for robust speech recognition , 2000, INTERSPEECH.

[41]  Yifan Gong,et al.  A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions , 2009, Computer Speech and Language.

[42]  Diego H. Milone,et al.  Introducing complexity measures in nonlinear physiological signals: application to robust speech recognition , 2004 .

[43]  YuDong Zhang,et al.  Pattern Recognition via PCNN and Tsallis Entropy , 2008, Sensors.

[44]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[45]  W. J. Williams,et al.  Mechanism of the cross-terms in spectrograms , 1992, IEEE Trans. Signal Process..

[46]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .