A Robust Speaker Clustering Method Based on Discrete Tied Variational Autoencoder

Recently, the speaker clustering model based on aggregation hierarchy cluster (AHC) is a common method to solve two main problems: no preset category number clustering and fix category number clustering. In general, model takes features like i-vectors as input of probability and linear discriminant analysis model (PLDA) aims to form the distance matric in long voice application scenario, and then clustering results are obtained through the clustering model. However, traditional speaker clustering method based on AHC has the shortcomings of long-time running and remains sensitive to environment noise. In this paper, we propose a novel speaker clustering method based on Mutual Information (MI) and a non-linear model with discrete variable, which under the enlightenment of Tied Variational Autoencoder (TVAE), to enhance the robustness against noise. The proposed method named Discrete Tied Variational Autoencoder (DTVAE) which shortens the elapsed time substantially. With experience results, it outperforms the general model and yields a relative Accuracy (ACC) improvement and significant time reduction.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Douglas A. Reynolds,et al.  Approaches to Speaker Detection and Tracking in Conversational Speech , 2000, Digit. Signal Process..

[3]  Douglas E. Sturim,et al.  Speaker Linking and Applications Using Non-Parametric Hashing Methods , 2016, INTERSPEECH.

[4]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[5]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Murray Shanahan,et al.  Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders , 2016, ArXiv.

[9]  Pietro Laface,et al.  Exact memory-constrained UPGMA for large scale speaker clustering , 2019, Pattern Recognit..

[10]  Huachun Tan,et al.  Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering , 2016, IJCAI.

[11]  Dong Wang,et al.  VAE-based regularization for deep speaker embedding , 2019, INTERSPEECH.

[12]  S. Shapiro,et al.  An analysis of variance test for normality ( complete samp 1 es ) t , 2007 .

[13]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[14]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[15]  Sébastien Marcel,et al.  Hierarchical speaker clustering methods for the NIST i-vector Challenge , 2014, Odyssey.

[16]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[17]  Alexander Kraskov,et al.  Published under the scientific responsability of the EUROPEAN PHYSICAL SOCIETY Incorporating , 2002 .

[18]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[19]  Yahui Su,et al.  A GMM-UBM Based Multi-speaker Re-segmentation and Re-clustering Algorithm , 2018, 2018 IEEE 18th International Conference on Communication Technology (ICCT).

[20]  Wei Liu,et al.  Deep Spectral Clustering Using Dual Autoencoder Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Zhihan Zhou,et al.  Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Chung-Hsien Wu,et al.  Speaker Clustering Using Decision Tree-Based Phone Cluster Models With Multi-Space Probability Distributions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Niko Brümmer,et al.  Tied Variational Autoencoder Backends for i-Vector Speaker Recognition , 2017, INTERSPEECH.