Spectro-temporal features for robust far-field speaker identification

Features derived from an auditory spectro-temporal representation of speech are proposed for robust far-field speaker identification. The auditory representation is obtained by first filtering the speech signal with a gammatone filterbank. A modulation filterbank is then applied to the temporal envelope of each gammatone filter output. Compared to commonly used mel-frequency cepstral coefficients (MFCC), the proposed features are shown to be more robust to mismatched conditions between enrollment and test data and are less sensitive to increasing reverberation time (RT ). Experiments with simulated and recorded far-field speech show that a Gaussian mixture model based identification system, trained on the proposed features, attains an average improvement in identification accuracy of 15% relative to a system trained onMFCC. Improvements of up to 85% are attained for larger RT .

[1]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[2]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Lukás Burget,et al.  Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  R.A. Goubran,et al.  Combating Reverberation in Speaker Verification , 2005, 2005 IEEE Instrumentationand Measurement Technology Conference Proceedings.

[5]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[6]  Zachary M. Smith,et al.  Chimaeric sounds reveal dichotomies in auditory perception , 2002, Nature.

[7]  Hua Yuan,et al.  Spectro-temporal processing for blind estimation of reverberation time and single-ended quality measurement of reverberant speech , 2007, INTERSPEECH.

[8]  Javier Ortega-Garcia,et al.  Increasing robustness in GMM speaker recognition systems for noisy and reverberant speech with low complexity microphone arrays , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  Misha Pavel,et al.  Intelligibility of speech with filtered time trajectories of spectral envelopes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[11]  Hans-Günter Hirsch,et al.  The simulation of realistic acoustic input scenarios for speech recognition systems , 2005, INTERSPEECH.

[12]  Marc Moonen,et al.  Multimicrophone Speech Dereverberation: Experimental Validation , 2007, EURASIP J. Audio Speech Music. Process..

[13]  Tanja Schultz,et al.  Far-Field Speaker Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Douglas L. Jones,et al.  Blind estimation of reverberation time. , 2003, The Journal of the Acoustical Society of America.