Robust speaker identification using an auditory-based feature

An auditory-based feature extraction algorithm is presented. The feature is based on a recently published time-frequency transform plus a set of modules to simulate the signal processing functions in the cochlea. The feature is applied to a speaker identification task to address the acoustic mismatch problem between training and testing. Usually, the performances of acoustic models trained in clean speech drop significantly when tested on noisy speech. The proposed feature has shown strong robustness in the mismatched situation. As shown in our experiments, in a speaker identification task, both MFCC and the proposed feature have near perfect performances in a clean testing condition, but when the SNR of input signal drops to 6 dB, the average accuracy of the MFCC feature is only 41.2%, while the proposed feature still achieves an average accuracy of 88.3%.

[1]  James M. Kates Accurate tuning curves in a cochlear model , 1993, IEEE Trans. Speech Audio Process..

[2]  Xiaoqin Wang,et al.  Contrast Tuning in Auditory Cortex , 2003, Science.

[3]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[4]  DeLiang Wang,et al.  Robust speaker identification using auditory features and computational auditory scene analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  S. S. Stevens On the psychophysical law. , 1957, Psychological review.

[6]  B Zhou,et al.  Auditory filter shapes at high frequencies. , 1995, The Journal of the Acoustical Society of America.

[7]  B. Moore,et al.  Auditory filter shapes at low center frequencies. , 1990, The Journal of the Acoustical Society of America.

[8]  J. Pierce,et al.  The cochlear compromise. , 1976, The Journal of the Acoustical Society of America.

[9]  E. Zwicker,et al.  Analytical expressions for critical‐band rate and critical bandwidth as a function of frequency , 1980 .

[10]  P. Coleman,et al.  Experiments in hearing , 1961 .

[11]  Chong-Kwan Un,et al.  On Predictive Coding of Speech Signals , 1985 .

[12]  B. Moore,et al.  Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[13]  Ian C. Bruce,et al.  Representation of the vowel /ε/ in normal and impaired auditory nerve fibers: Model predictions of responses in cats , 2007 .

[14]  R. Patterson Auditory filter shapes derived with noise stimuli. , 1976, The Journal of the Acoustical Society of America.

[15]  S. S. Stevens Perceived Level of Noise by Mark VII and Decibels (E) , 1972 .

[16]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[17]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[18]  James M. Kates,et al.  A time-domain digital cochlear model , 1991, IEEE Trans. Signal Process..