Learning deep representations by multilayer bootstrap networks for speaker diarization

The performance of speaker diarization is strongly affected by its clustering algorithm at the test stage. However, it is known that clustering algorithms are sensitive to random noises and small variations, particularly when the clustering algorithms themselves suffer some weaknesses, such as bad local minima and prior assumptions. To deal with the problem, a compact representation of speech segments with small within-class variances and large between-class distances is usually needed. In this paper, we apply an unsupervised deep model, named multilayer bootstrap network (MBN), to further process the embedding vectors of speech segments for the above problem. MBN is an unsupervised deep model for nonlinear dimensionality reduction. Unlike traditional neural network based deep model, it is a stack of $k$-centroids clustering ensembles, each of which is trained simply by random resampling of data and one-nearest-neighbor optimization. We construct speaker diarization systems by combining MBN with either the i-vector frontend or x-vector frontend, and evaluated their effectiveness on a simulated NIST diarization dataset, the AMI meeting corpus, and NIST SRE 2000 CALLHOME database. Experimental results show that the proposed systems are better than or at least comparable to the systems that do not use MBN.

[1]  Jun Du,et al.  A Novel LSTM-Based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.

[3]  Sanjeev Khudanpur,et al.  Characterizing Performance of Speaker Diarization Systems on Far-Field Speech Using Standard Methods , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[5]  Xiao-Lei Zhang,et al.  Multilayer bootstrap networks , 2014, Neural Networks.

[6]  Quan Wang,et al.  Speaker Diarization with LSTM , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Marek Hrúz,et al.  Convolutional Neural Network for speaker change detection in telephone speaker diarization system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[10]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[11]  Douglas A. Reynolds,et al.  Diarization of Telephone Conversations Using Factor Analysis , 2010, IEEE Journal of Selected Topics in Signal Processing.

[12]  James R. Glass,et al.  On the Use of Spectral and Iterative Methods for Speaker Diarization , 2012, INTERSPEECH.

[13]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[14]  Ming Li,et al.  LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization , 2019, INTERSPEECH.

[15]  Ludek Müller,et al.  Speaker Diarization Using Convolutional Neural Network for Statistics Accumulation Refinement , 2017, INTERSPEECH.

[16]  Themos Stafylakis,et al.  A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[18]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).