First investigations on self trained speaker diarization

This paper investigates self trained cross-show speaker diarization applied to collections of French TV archives, based on an i-vector/PLDA framework. The parameters used for i-vectors extraction and PLDA scoring are trained in a unsupervised way, using the data of the collection itself. Performances are compared, using combinations of target data and external data for training. The experimental results on two distinct target cor- pora show that using data from the corpora themselves to perform unsupervised iterative training and domain adaptation of PLDA parameters can improve an existing system, trained on external annotated data. Such results indicate that perform- ing speaker indexation on small collections of unlabeled audio archives should only rely on the availability of a sufficient external corpus, which can be specifically adapted to every target collection. We show that a minimum collection size is required to exclude the use of such an external bootstrap.

[1]  Tanja Schultz,et al.  Investigation of Cross-Show Speaker Diarization , 2011, INTERSPEECH.

[2]  Sridha Sridharan,et al.  Speaker attribution of multiple telephone conversations using a complete-linkage clustering approach , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  William M. Campbell,et al.  Graph-embedding for speaker recognition , 2010, INTERSPEECH.

[4]  William M. Campbell,et al.  Large-scale community detection on speaker content graphs , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Sylvain Meignier,et al.  Is incremental cross-show speaker diarization efficient for processing large volumes of data? , 2014, INTERSPEECH.

[6]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[7]  Paul Deléglise,et al.  Recent Improvements on ILP-based Clustering for Broadcast News Speaker Diarization , 2014, Odyssey.

[8]  Olivier Galibert,et al.  The ETAPE speech processing evaluation , 2014, LREC.

[9]  Driss Matrouf,et al.  Variance-spectra based normalization for i-vector standard and probabilistic linear discriminant analysis , 2012, Odyssey.

[10]  Guillaume Gravier,et al.  The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.

[11]  Olivier Galibert,et al.  The First Official REPERE Evaluation , 2013, SLAM@INTERSPEECH.

[12]  Olivier Galibert,et al.  Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech , 2013, INTERSPEECH.

[13]  Douglas A. Reynolds,et al.  Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems , 2014, Odyssey.

[14]  Lori Lamel,et al.  Comparing Multi-Stage Approaches for Cross-Show Speaker Diarization , 2011, INTERSPEECH.

[15]  Ming Li,et al.  An iterative framework for unsupervised learning in the PLDA based speaker verification , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[16]  Delphine Charlet,et al.  Speaker diarization with unsupervised training framework , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Marc Ferras,et al.  Speaker diarization and linking of large corpora , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[18]  David A. van Leeuwen Speaker linking in large data sets , 2010, Odyssey.

[19]  Kong-Aik Lee,et al.  An extensible speaker identification sidekit in Python , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).