Robust speaker identification using a CASA front-end

Speaker recognition remains a challenging task under noisy conditions. Inspired by auditory perception, computational auditory scene analysis (CASA) typically segregates speech by producing a binary time-frequency mask. We first show that a recently introduced speaker feature, Gammatone Frequency Cepstral Coefficient, performs substantially better than conventional speaker features under noisy conditions. To deal with noisy speech, we apply CASA separation and then either reconstruct or marginalize corrupted components indicated by the CASA mask. Both methods are effective. We further combine them into a single system depending on the detected signal to noise ratio (SNR). This system achieves significant performance improvements over related systems under a wide range of SNR conditions.

[1]  DeLiang Wang,et al.  Robust speaker identification using auditory features and computational auditory scene analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[3]  Roberto Togneri,et al.  Robust speaker identification using combined feature selection and missing data recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  DeLiang Wang,et al.  A multipitch tracking algorithm for noisy and reverberant speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Andrzej Drygajlo,et al.  Speaker verification in noisy environments with combined spectral subtraction and missing feature theory , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  DeLiang Wang,et al.  A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  DeLiang Wang,et al.  Robust Speaker Recognition Using Binary Time-Frequency Masks , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[9]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[10]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.