But System for the Second Dihard Speech Diarization Challenge

This paper describes the winning systems developed by the BUT team for the four tracks of the Second DIHARD Speech Diarization Challenge. For tracks 1 and 2 the systems were mainly based on performing agglomerative hierarchical clustering (AHC) of x-vectors, followed by another x-vector clustering based on Bayes hidden Markov model and variational Bayes inference. We provide a comparison of the improvement given by each step and share the implementation of the core of the system. For tracks 3 and 4 with recordings from the Fifth CHiME Challenge, we explored different approaches for doing multi-channel diarization and our best performance was obtained when applying AHC on the fusion of per channel probabilistic linear discriminant analysis scores.

[1]  Mireia Díez,et al.  Speaker Diarization based on Bayesian HMM with Eigenvoice Priors , 2018, Odyssey.

[2]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[3]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[4]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[5]  P. Somervuo,et al.  Bayesian Analysis of Speaker Diarization with Eigenvoice Priors , 2008 .

[6]  Jan Cernocký,et al.  Bayesian HMM Based x-Vector Clustering for Speaker Diarization , 2019, INTERSPEECH.

[7]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[8]  Shuai Wang,et al.  BUT System Description for DIHARD Speech Diarization Challenge 2019 , 2019 .

[9]  Hagai Aronowitz,et al.  Audio enhancing with DNN autoencoder for speaker recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[11]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[12]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  Mireia Díez,et al.  BUT System for DIHARD Speech Diarization Challenge 2018 , 2018, INTERSPEECH.

[16]  Tillman Weyde,et al.  Improved Speech Enhancement with the Wave-U-Net , 2018, ArXiv.

[17]  Federico Landini,et al.  Analysis of Speaker Diarization Based on Bayesian HMM With Eigenvoice Priors , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[20]  Mireia Díez,et al.  Optimizing Bayesian Hmm Based X-Vector Clustering for the Second Dihard Speech Diarization Challenge , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).