Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks

Speaker diarization (detecting who-spoke-when using relative identity labels) and speaker recognition (detecting absolute identity labels without timing) are different but related tasks that often need to be completed simultaneously in many scenarios. Traditional methods, however, address them independently. In this paper, we propose a method to jointly diarize and recognize speakers from a collection of conversations. This method benefits from the sparsity and temporal smoothness of speakers within a conversation and the large-scale timbre modeling across recordings and speakers. Specifically, we employ one convolutional neural network (CNN) to perform segment-level speaker classification and another CNN to detect the probability of speaker change within a conversation. We then concatenate the output of both CNNs and feed it into a recurrent neural network (RNN) for joint speaker diarization and recognition. Experiments on different datasets show promising performance of our proposed approach.

[1]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[2]  Vishwa Gupta Speaker change point detection using deep neural nets , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[4]  Tomasz Trzcinski,et al.  Speaker Diarization Using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings , 2017, ISAT.

[5]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Theodoros Giannakopoulos pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[8]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Zhenhao Ge,et al.  Speaker change detection using features through a neural network speaker classifier , 2017, 2017 Intelligent Systems Conference (IntelliSys).

[10]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[11]  Mickael Rouvier,et al.  Speaker diarization through speaker embeddings , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[12]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[13]  Marek Hrúz,et al.  Convolutional Neural Network for speaker change detection in telephone speaker diarization system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Yan Song,et al.  Improved i-Vector Representation for Speaker Diarization , 2016, Circuits Syst. Signal Process..

[15]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  James M. Joyce Kullback-Leibler Divergence , 2011, International Encyclopedia of Statistical Science.

[18]  Themos Stafylakis,et al.  A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.