LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization

More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional algorithms like probabilistic linear discriminant analysis (PLDA) are widely used for scoring the similarity between two speech segments. In this paper, we propose a supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is applied on top of the similarity matrix to further improve the performance. Experimental results show that our system significantly outperforms the state-of-the-art methods and achieves a diarization error rate of 6.63\% on the NIST SRE 2000 CALLHOME database.

[1]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[2]  Daniel Garcia-Romero,et al.  Diarization resegmentation in the factor analysis subspace , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[4]  Ming Li,et al.  Analysis of Length Normalization in End-to-End Speaker Verification System , 2018, INTERSPEECH.

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  Marek Hrúz,et al.  Convolutional Neural Network for speaker change detection in telephone speaker diarization system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Quan Wang,et al.  Speaker Diarization with LSTM , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Claude Barras,et al.  Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks , 2017, INTERSPEECH.

[12]  G. Krishna,et al.  Agglomerative clustering using the concept of mutual nearest neighbourhood , 1978, Pattern Recognit..

[13]  Claude Barras,et al.  Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization , 2018, INTERSPEECH.

[14]  Quan Wang,et al.  Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[16]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  James R. Glass,et al.  Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[20]  Guillaume Wisniewski,et al.  Combining Speaker Turn Embedding and Incremental Structure Prediction for Low-Latency Speaker Diarization , 2017, INTERSPEECH.

[21]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Sylvain Meignier,et al.  LIUM SPKDIARIZATION: AN OPEN SOURCE TOOLKIT FOR DIARIZATION , 2010 .

[23]  Anantha P. Chandrakasan,et al.  A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks , 2018, IEEE Journal of Solid-State Circuits.