Robust Multifactor Speech Feature Extraction Based on Gabor Analysis

The performance of speech recognition systems relies on the consistency and adaptation of the speech feature in complex conditions during the training and testing stages. Traditional systems usually perform poorly under adverse noisy conditions and are not applicable to most real world problems. In this paper, we investigate the speech feature extraction problem in a noisy environment and propose a novel approach based on Gabor filtering and tensor factorization. Recent physiological and psychoacoustic experimental results suggest that the localized spectro-temporal features are essential for auditory perception. To explore this property, we represent the speech signal by using a general higher order tensor and employ two-dimensional Gabor functions with different scales and directions to analyze the localized patches of the power spectrogram. Then the Nonnegative Tensor PCA with sparse constraints is proposed to learn the projection matrices from multiple interrelated feature subspaces. The objective of the sparse constraints is to preserve the statistical characteristic of clean speech data by finding projection matrices of speech subspaces and reduce the noise components which have distributions different from those of clean speech. A multifactor analysis method is proposed to extract robust sparse features by processing the data samples in tensor structure. The simulation results indicate that our proposed method is able to improve the speech recognition performance, especially in noisy environments, compared with the traditional speech feature extraction methods.

[1]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[3]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[4]  R. Bro PARAFAC. Tutorial and applications , 1997 .

[5]  Kuansan Wang,et al.  Spectral shape analysis in the central auditory system , 1995, IEEE Trans. Speech Audio Process..

[6]  Xuelong Li,et al.  Tensor Rank One Discriminant Analysis - A convergent method for discriminative multilinear subspace selection , 2008, Neurocomputing.

[7]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[8]  Chengjun Liu,et al.  Gabor-based kernel PCA with fractional power polynomial models for face recognition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Andrzej Cichocki,et al.  Non-Negative Tensor Factorization using Alpha and Beta Divergences , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Liqing Zhang,et al.  Auditory Sparse Representation for Robust Speaker Recognition Based on Tensor Structure , 2008, EURASIP J. Audio Speech Music. Process..

[11]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[12]  Biing-Hwang Juang,et al.  Speech Analysis in a Model of the Central Auditory System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[14]  Aaron E. Rosenberg,et al.  Cepstral channel normalization techniques for HMM-based speaker verification , 1994, ICSLP.

[15]  Aapo Hyvärinen,et al.  Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation , 1999, Neural Computation.

[16]  C. Floudas,et al.  Quadratic Optimization , 1995 .

[17]  Tamir Hazan,et al.  Non-negative tensor factorization with applications to statistics and computer vision , 2005, ICML.

[18]  L. Lathauwer,et al.  Signal Processing based on Multilinear Algebra , 1997 .

[19]  Xuelong Li,et al.  General Tensor Discriminant Analysis and Gabor Features for Gait Recognition , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Liqing Zhang,et al.  Robust speech feature extraction based on Gabor filtering and tensor factorization , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Mounya Elhilali,et al.  A spectro-temporal modulation index (STMI) for assessment of speech intelligibility , 2003, Speech Commun..

[22]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[23]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[24]  Michael S. Lewicki,et al.  Efficient auditory coding , 2006, Nature.

[25]  Jeih-Weih Hung,et al.  Constructing Modulation Frequency Domain-Based Features for Robust Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Hugo Van hamme,et al.  A Review of Signal Subspace Speech Enhancement and Its Application to Noise Robust Speech Recognition , 2007, EURASIP J. Adv. Signal Process..

[27]  Tony Ezzat,et al.  Spectro-temporal analysis of speech using 2-d Gabor filters , 2007, INTERSPEECH.

[28]  C. Schreiner,et al.  Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. , 2003, Journal of neurophysiology.

[29]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[30]  Shihab Shamma,et al.  Auditory Representations of Timbre and Pitch , 1996 .

[31]  Haizhou Li,et al.  Temporal Structure Normalization of Speech Feature for Robust Speech Recognition , 2007, IEEE Signal Processing Letters.

[32]  Seungjin Choi,et al.  Nonnegative Tucker Decomposition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[34]  Joos Vandewalle,et al.  A Multilinear Singular Value Decomposition , 2000, SIAM J. Matrix Anal. Appl..

[35]  Jeih-Weih Hung,et al.  Optimization of temporal filters for constructing robust features in speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[37]  Amnon Shashua,et al.  Nonnegative Sparse PCA , 2006, NIPS.