Investigating Deep Neural Networks for Speaker Diarization in the DIHARD Challenge

We investigate the use of deep neural networks (DNNs) for the speaker diarization task to improve performance under domain mismatched conditions. Three unsupervised domain adaptation techniques, namely inter-dataset variability compensation (IDVC), domain-invariant covariance normalization (DICN), and domain mismatch modeling (DMM), are applied on DNN based speaker embeddings to compensate for the mismatch in the embedding subspace. We present results conducted on the DIHARD data, which was released for the 2018 diarization challenge. Collected from a diverse set of domains, this data provides very challenging domain mismatched conditions for the diarization task. Our results provide insights into how the performance of our proposed system could be further improved.

[1]  Oliver Durr,et al.  Speaker identification and clustering using convolutional neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[2]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Sridha Sridharan,et al.  Improving out-domain PLDA speaker verification using unsupervised inter-dataset variability compensation approach , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[6]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[8]  Thilo Stadelmann,et al.  Speaker identification and clustering using convolutional neural networks , 2016 .

[9]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[10]  Douglas A. Reynolds,et al.  Diarization of Telephone Conversations Using Factor Analysis , 2010, IEEE Journal of Selected Topics in Signal Processing.

[11]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[12]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[13]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[15]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Sridha Sridharan,et al.  Dataset-invariant covariance normalization for out-domain PLDA speaker verification , 2015, INTERSPEECH.

[17]  Alan McCree,et al.  Speaker diarization with i-vectors from DNN senone posteriors , 2015, INTERSPEECH.

[18]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Niko Brümmer,et al.  Unsupervised Domain Adaptation for I-Vector Speaker Recognition , 2014, Odyssey.

[20]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sridha Sridharan,et al.  Domain Mismatch Modeling of Out-Domain i-Vectors for PLDA Speaker Verification , 2017, INTERSPEECH.

[22]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Itshak Lapidot,et al.  On the Use of PLDA i-vector Scoring for Clustering Short Segments , 2016, Odyssey.

[24]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[25]  Oliver Durr,et al.  Learning embeddings for speaker clustering based on voice equality , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[26]  Mickael Rouvier,et al.  Speaker diarization through speaker embeddings , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[27]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[28]  Alan McCree,et al.  Priors for Speaker Counting and Diarization with AHC , 2016, INTERSPEECH.

[29]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Tomasz Trzcinski,et al.  Speaker Diarization Using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings , 2017, ISAT.

[31]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[32]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).