Online speaker diarization using adapted i-vector transforms

Many speaker diarization systems operate in an off-line mode. Such systems typically find homogeneous segments and then cluster these segments according to speaker. Such algorithms, like bottom-up clustering, k-means or spectral clustering, generally require the registration of all segments before clustering can begin. However, for real-time applications such as with multi-person voice interactive systems, there is a need to perform online speaker assignment in a strict left-to-right fashion. In this paper we propose a novel Maximum a Posteriori (MAP) adapted transform within an i-vector speaker diarization framework, that operates in a strict left-to-right fashion. Previous work by the community has shown that the principal components of variation of fixed dimensional i-vectors learned across segments tend to indicate a strong basis by which to separate speakers. However, determining this basis can be problematic when there are few segments or when operating in an online manner. The proposed method blends the prior with the estimated subspace as more i-vectors are observed. Given oracle SAD segments, with adaptation we achieve 3.2% speaker diarization error for a strict left-to-right constraint on the LDC Callhome English Corpus compared to 4.8% without adaptation.

[1]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[2]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[3]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Gerald Friedland,et al.  A hybrid approach to online speaker diarization , 2010, INTERSPEECH.

[5]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[6]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[7]  George Saon,et al.  On the importance of event detection for ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[9]  Seyed Omid Sadjadi,et al.  Nearest neighbor based i-vector normalization for robust speaker recognition under unseen channel conditions , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Nicholas W. D. Evans,et al.  Adaptive and online speaker diarization for meeting data , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[12]  Gerhard Rigoll,et al.  GMM-UBM based open-set online speaker diarization , 2010, INTERSPEECH.

[13]  James R. Glass,et al.  Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Satoshi Nakamura,et al.  Never-ending learning system for on-line speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[15]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[16]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[17]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .