Hierarchical speaker clustering methods for the NIST i-vector Challenge

The process of manually labeling data is very expensive and sometimes infeasible due to privacy and security issues. This paper investigates the use of two algorithms for clustering unlabeled training i-vectors. This aims at improving speaker recognition performance by using state-of-the-art supervised techniques in the context of the NIST i-vector Machine Learning Challenge 2014. The first algorithm is the well-known Ward clustering that aims at optimizing an objective function across all clusters. The second one is a cascade clustering, which benefits from the latest advances in speaker modeling and session compensation techniques, and relies on both the cosine similarity and probabilistic linear discriminant analysis (PLDA). Furthermore, this paper investigates the multi-clustering fusion that opens the door for further improvements. The experimental results show that the use of the automatically labeled i-vectors to train supervised methods such as LDA, PLDA or linear logistic regression-based fusion, decreases the minimum decision cost function by up to 22%.

[1]  James R. Glass,et al.  On the Use of Spectral and Iterative Methods for Speaker Diarization , 2012, INTERSPEECH.

[2]  Hsin-Min Wang,et al.  Clustering speech utterances by speaker using Eigenvoice-motivated vector space models , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3]  Sébastien Marcel,et al.  Spear: An open source toolbox for speaker recognition based on Bob , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Sébastien Marcel,et al.  Bob: a free signal processing and machine learning toolbox for researchers , 2012, ACM Multimedia.

[5]  Tomi Kinnunen,et al.  Effect of multicondition training on i-vector PLDA configurations for speaker recognition , 2013, INTERSPEECH.

[6]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Themos Stafylakis,et al.  A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Philippe Joly,et al.  Audiovisual diarization of people in video content , 2012, Multimedia Tools and Applications.

[9]  Bin Ma,et al.  PLDA Modeling in I-Vector and Supervector Space for Speaker Verification , 2012, INTERSPEECH.

[10]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[11]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[12]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[13]  James R. Glass,et al.  Exploiting Intra-Conversation Variability for Speaker Diarization , 2011, INTERSPEECH.

[14]  Andrew H. Sung,et al.  A Similarity Measure for Clustering and its Applications , 2008 .

[15]  Mickael Rouvier,et al.  A global optimization framework for speaker diarization , 2012, Odyssey.

[16]  Marc Ferras,et al.  Speaker diarization and linking of large corpora , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[17]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Jan Silovský,et al.  Speaker diarization using PLDA-based speaker clustering , 2011, Proceedings of the 6th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems.

[19]  James R. Glass,et al.  Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  John H. L. Hansen,et al.  I4u submission to NIST SRE 2012: a large-scale collaborative effort for noise-robust speaker verification , 2013, INTERSPEECH.

[21]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[22]  Pascal Druyts,et al.  Applying Logistic Regression to the Fusion of the NIST'99 1-Speaker Submissions , 2000, Digit. Signal Process..

[23]  Themos Stafylakis,et al.  Efficient iterative mean shift based cosine dissimilarity for multi-recording speaker clustering , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Elie el Khoury,et al.  Speaker Diarization: Towards a More Robust and Portable System , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[25]  Niko Brümmer,et al.  Towards Fully Bayesian Speaker Recognition: Integrating Out the Between-Speaker Covariance , 2011, INTERSPEECH.

[26]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[27]  MarcelSebastien,et al.  A Scalable Formulation of Probabilistic Linear Discriminant Analysis , 2013 .

[28]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[29]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[30]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[31]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).