Meta-Learning With Latent Space Clustering in Generative Adversarial Network for Speaker Diarization

The performance of most speaker diarization systems with x-vector embeddings is both vulnerable to noisy environments and lacks domain robustness. Earlier work on speaker diarization using generative adversarial network (GAN) with an encoder network (ClusterGAN) to project input x-vectors into a latent space has shown promising performance on meeting data. In this paper, we extend the ClusterGAN network to improve diarization robustness and enable rapid generalization across various challenging domains. To this end, we fetch the pre-trained encoder from the ClusterGAN and fine-tune it by using prototypical loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD II (dev set) which includes challenging multi-domain corpus, and two child-clinician interaction corpora (ADOS, BOSCC) related to the autism spectrum disorder domain. Extensive analyses of the experimental data are done to investigate the effectiveness of the proposed ClusterGAN and MCGAN embeddings over x-vectors. The results show that the proposed embeddings with normalized maximum eigengap spectral clustering (NME-SC) back-end consistently outperform Kaldi state-of-the-art z-vector diarization system. Finally, we employ embedding fusion with x-vectors to provide further improvement in diarization performance. We achieve a relative diarization error rate (DER) improvement of 6.67% to 53.93% on the aforementioned datasets using the proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide better performance in the number of speakers estimation and short speech segment diarization as compared to x-vectors and ClusterGAN in telephonic data.

[1]  Shrikanth S. Narayanan,et al.  Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ming Li,et al.  LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization , 2019, INTERSPEECH.

[4]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[5]  Daniel Cremers,et al.  Clustering with Deep Learning: Taxonomy and New Methods , 2018, ArXiv.

[6]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[7]  Junjie Wang,et al.  DIHARD II is Still Hard: Experimental Results and Discussions from the DKU-LENOVO Team , 2020, ArXiv.

[8]  Alessio Brutti,et al.  Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[10]  Shrikanth Narayanan,et al.  Meta-Learning for Robust Child-Adult Classification from Speech , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shrikanth Narayanan,et al.  The Second DIHARD Challenge: System Description for USC-SAIL Team , 2019, INTERSPEECH.

[13]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.

[14]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[15]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[16]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[17]  Dhruv Batra,et al.  Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[19]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Shrikanth Narayanan,et al.  Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap , 2020, IEEE Signal Processing Letters.

[21]  Alvin F. Martin,et al.  Speaker recognition in a multi-speaker environment , 2001, INTERSPEECH.

[22]  Hoirin Kim,et al.  Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs , 2020, INTERSPEECH.

[23]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Jun Zhu,et al.  Triple Generative Adversarial Nets , 2017, NIPS.

[25]  Subarna Tripathi,et al.  Precise Recovery of Latent Vectors from Generative Adversarial Networks , 2017, ICLR.

[26]  James R. Glass,et al.  Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Bo Yang,et al.  Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering , 2016, ICML.

[28]  Quan Wang,et al.  Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Ahmad Salman,et al.  Learning Speaker-Specific Characteristics With a Deep Neural Architecture , 2011, IEEE Transactions on Neural Networks.

[30]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[31]  Shuai Wang,et al.  BUT System Description for DIHARD Speech Diarization Challenge 2019 , 2019 .

[32]  Ludek Müller,et al.  Speaker Diarization Using Convolutional Neural Network for Statistics Accumulation Refinement , 2017, INTERSPEECH.

[33]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[34]  Thomas S. Huang,et al.  A spectral clustering approach to speaker diarization , 2006, INTERSPEECH.

[35]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[36]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[37]  Themos Stafylakis,et al.  A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Chao Zhang,et al.  Speaker Diarisation Using 2D Self-attentive Combination of Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Shrikanth Narayanan,et al.  Multi-Task Discriminative Training of Hybrid DNN-TVM Model for Speaker Verification with Noisy and Far-Field Speech , 2019, INTERSPEECH.

[40]  Sergey Novoselov,et al.  Speaker Diarization with Deep Speaker Embeddings for DIHARD Challenge II , 2019, Interspeech.

[41]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[42]  Philip C. Woodland,et al.  Discriminative Neural Clustering for Speaker Diarisation , 2019, ArXiv.

[43]  Yu Cheng,et al.  Diverse Few-Shot Text Classification with Multiple Metrics , 2018, NAACL.

[44]  Andrew Pickles,et al.  Measuring Changes in Social Communication Behaviors: Preliminary Development of the Brief Observation of Social Communication Change (BOSCC) , 2016, Journal of autism and developmental disorders.

[45]  Sreeram Kannan,et al.  ClusterGAN : Latent Space Clustering in Generative Adversarial Networks , 2018, AAAI.

[46]  Anil A. Bharath,et al.  Inverting the Generator of a Generative Adversarial Network , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[47]  Shrikanth Narayanan,et al.  Speaker Diarization Using Latent Space Clustering in Generative Adversarial Network , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Cheng Deng,et al.  Balanced Self-Paced Learning for Generative Adversarial Clustering Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[50]  Quan Wang,et al.  Speaker Diarization with LSTM , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Pavel Korshunov,et al.  The Speed Submission to DIHARD II: Contributions & Lessons Learned , 2019, ArXiv.

[52]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[53]  B. Leventhal,et al.  The Autism Diagnostic Observation Schedule—Generic: A Standard Measure of Social and Communication Deficits Associated with the Spectrum of Autism , 2000, Journal of autism and developmental disorders.

[54]  Ronen Basri,et al.  SpectralNet: Spectral Clustering using Deep Neural Networks , 2018, ICLR.

[55]  Shrikanth Narayanan,et al.  A study of semi-supervised speaker diarization system using gan mixture model , 2019, ArXiv.

[56]  Huachun Tan,et al.  Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering , 2016, IJCAI.

[57]  Frank Rudzicz,et al.  Centroid-based Deep Metric Learning for Speaker Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[59]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[60]  Federico Landini,et al.  Analysis of Speaker Diarization Based on Bayesian HMM With Eigenvoice Priors , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[61]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Heng Huang,et al.  Conditional generative adversarial network for gene expression inference , 2018, Bioinform..

[63]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[65]  Quan Wang,et al.  Links: A High-Dimensional Online Clustering Method , 2018, ArXiv.

[66]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[67]  Yang Yu,et al.  Mixture of GANs for Clustering , 2018, IJCAI.

[68]  Dimitrios Dimitriadis Enhancements for Audio-only Diarization Systems , 2019, ArXiv.

[69]  Fabio Valente,et al.  An Information Theoretic Approach to Speaker Diarization of Meeting Data , 2009, IEEE Transactions on Audio, Speech, and Language Processing.