Speaker Embeddings Incorporating Acoustic Conditions for Diarization

We present our work on training speaker embeddings, especially effective for speaker diarization. For various speaker recognition tasks, extracting speaker embeddings using Deep Neural Networks (DNNs) has become major methods. These embeddings are generally trained to be discriminate speakers and be robust with respect to different acoustic conditions. In speaker diarization, however, the acoustic conditions can be used as consistent information for discriminating speakers. Such information can include the distances to a microphone in a meeting, or the channels for each speaker in telephone conversation recorded in monaural. Hence, the proposed speaker-embedding network leverages differences in acoustic conditions to train effective speaker embeddings for speaker diarization. The information on acoustic conditions can be anything that contributes to distinguishing between recording environments; for example, we explore using i-vectors. Experiments conducted on a practical diarization system demonstrated that the proposed embeddings significantly improve performance over embeddings without information on acoustic conditions.

[1]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[7]  James R. Glass,et al.  Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Jason W. Pelecanos,et al.  Online speaker diarization using adapted i-vector transforms , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[10]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Yifan Gong,et al.  End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[12]  Quan Wang,et al.  Speaker Diarization with LSTM , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Shuai Wang,et al.  Angular Softmax for Short-Duration Text-independent Speaker Verification , 2018, INTERSPEECH.

[19]  Hagen Soltau,et al.  Joint Speech Recognition and Speaker Diarization via Sequence Transduction , 2019, INTERSPEECH.

[20]  Petr Fousek,et al.  Developing On-Line Speaker Diarization System , 2017, INTERSPEECH.

[21]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.