Speaker Diarization with Region Proposal Network

Speaker diarization is an important pre-processing step for many speech applications, and it aims to solve the "who spoke when" problem. Although the standard diarization systems can achieve satisfactory results in various scenarios, they are composed of several independently-optimized modules and cannot deal with the overlapped speech. In this paper, we propose a novel speaker diarization method: Region Proposal Network based Speaker Diarization (RPNSD). In this method, a neural network generates overlapped speech segment proposals, and compute their speaker embeddings at the same time. Compared with standard diarization systems, RPNSD has a shorter pipeline and can handle the overlapped speech. Experimental results on three diarization datasets reveal that RPNSD achieves remarkable improvements over the state-of-the-art x-vector baseline.

[1]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[3]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  P. Somervuo,et al.  Bayesian Analysis of Speaker Diarization with Eigenvoice Priors , 2008 .

[5]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[6]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[7]  Mireia Díez,et al.  Speaker Diarization based on Bayesian HMM with Eigenvoice Priors , 2018, Odyssey.

[8]  Quan Wang,et al.  Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[10]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Jun Du,et al.  Speaker Diarization with Enhancing Speech for the First DIHARD Challenge , 2018, INTERSPEECH.

[13]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[15]  Quan Wang,et al.  Speaker Diarization with LSTM , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[18]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[19]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[21]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[22]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[23]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Eduardo Lleida,et al.  Estimation of the Number of Speakers with Variational Bayesian PLDA in the DIHARD Diarization Challenge , 2018, INTERSPEECH.

[25]  Mireia Díez,et al.  BUT System for DIHARD Speech Diarization Challenge 2018 , 2018, INTERSPEECH.

[26]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Chao Wang,et al.  R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection , 2018, INTERSPEECH.

[28]  Daniel Garcia-Romero,et al.  Diarization resegmentation in the factor analysis subspace , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[31]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[33]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[34]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.