On the use of GSV-SVM for Speaker Diarization and Tracking

In this paper, we present the use of Gaussian Supervectors with Support Vector Machines classifiers (GSV-SVM) in an acoustic speaker diarization and a speaker tracking system, compared with a standard Gaussian Mixture Model system based on adapted Universal Background Models (GMM-UBM). GSVSVM systems (which share the adaptation step with the GMMUBM systems) are observed to have comparable performances: for acoustic speaker diarization, the GMM-UBM system outperforms the GSV-SVM system on ESTER2 data but the latter system works better in the speaker tracking system. In particular, the linear combination of two systems at the score level outperforms each individual system.

[1]  Guillaume Gravier,et al.  The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.

[2]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[3]  Sue Tranter Two-way cluster voting to improve speaker diarisation performance , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  Cheung-Chi Leung,et al.  Acoustic Speaker Identification: The LIMSI CLEAR'07 System , 2007, CLEAR.

[5]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[6]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[7]  Cheung-Chi Leung,et al.  Constrained MLLR for Speaker Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[9]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Andreas Stolcke,et al.  Improvements in MLLR-Transform-based Speaker Recognition , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[11]  Jean-Luc Gauvain,et al.  Lattice-based MLLR for speaker recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Jean-François Bonastre,et al.  The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[14]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[15]  G. Gravier,et al.  STER evaluation campaign of rich transcription of French broadcast news , 2011 .

[16]  Fall 2004 Rich Transcription ( RT-04 F ) Evaluation Plan , .