GMM/SVM N-best speaker identification under mismatch channel conditions

Under severe channel mismatch conditions, such as training with far-field speech and testing with telephone data, performance of speaker identification (SID) degrades significantly, often below practical use. But for many SID tasks, it is sufficient to recognize an N-best list of speakers for further human analysis. We investigate N-best SID accuracy for matched (telephone/telephone) and mismatched (far-field/telephone) train/test channel conditions. Using an SVM-GMM supervector (GSV), pitch and formant frequency histograms (PFH) and cross-channel adaptation using cohorts, we reduced matched channel error rate by over 25% relative to the baseline (GMM-UBM), for top-1, and achieved mismatched N-best accuracy comparable to the baseline.

[1]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[2]  Vapnik,et al.  SVMs for Histogram Based Image Classification , 1999 .

[3]  Douglas E. Sturim,et al.  Robust Speaker Recognition with Cross-Channel Data: MIT-LL Results on the 2006 NIST SRE Auxiliary Microphone Task , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Douglas E. Sturim,et al.  The MIT-LL/IBM 2006 Speaker Recognition System: High-Performance Reduced-Complexity Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  M. Sambur,et al.  Selection of acoustic features for speaker identification , 1975 .

[6]  Andreas Stolcke,et al.  The Contribution of Cepstral and Stylistic Features to SRI's 2005 NIST Speaker Recognition Evaluation System , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Aaron E. Rosenberg,et al.  Foldering voicemail messages by caller using text independent speaker recognition , 2000, INTERSPEECH.

[8]  Aaron E. Rosenberg,et al.  Caller identification for the SCANMail voicemail browser , 2001, INTERSPEECH.

[9]  Tanja Schultz,et al.  Far-Field Speaker Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Lou Boves,et al.  Comparing acoustic features for robust ASR in fixed and cellular network applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  Shai Fine,et al.  A hybrid GMM/SVM approach to speaker identification , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).