Text-Independent Speaker Verification with Dual Attention Network

This paper presents a novel design of attention model for text-independent speaker verification. The model takes a pair of input utterances and generates an utterance-level embedding to represent speaker-specific characteristics in each utterance. The input utterances are expected to have highly similar embeddings if they are from the same speaker. The proposed attention model consists of a self-attention module and a mutual attention module, which jointly contributes to the generation of the utterance-level embedding. The self-attention weights are computed from the utterance itself while the mutual-attention weights are computed with the involvement of the other utterance in the input pairs. As a result, each utterance is represented by a self-attention weighted embedding and a mutual-attention weighted embedding. The similarity between the embeddings is measured by a cosine distance score and a binary classifier output score. The whole model, named Dual Attention Network, is trained end-to-end on Voxceleb database. The evaluation results on Voxceleb 1 test set show that the Dual Attention Network significantly outperforms the baseline systems. The best result yields an equal error rate of 1:6%.

[1]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[2]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[3]  Dong Yu,et al.  Boundary Discriminative Large Margin Cosine Loss for Text-independent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jian Cheng,et al.  Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[9]  Patrick Kenny,et al.  Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.

[10]  Chengzhu Yu,et al.  Seq2Seq Attentional Siamese Neural Networks for Text-dependent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jia Liu,et al.  Large Margin Softmax Loss for Speaker Verification , 2019, INTERSPEECH.

[12]  Ian McLoughlin,et al.  An Effective Deep Embedding Learning Architecture for Speaker Verification , 2019, INTERSPEECH.

[13]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[14]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[17]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[19]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[20]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[21]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[23]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[24]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[25]  Xiaogang Wang,et al.  SCAN: Self-and-Collaborative Attention Network for Video Person Re-Identification , 2018, IEEE Transactions on Image Processing.

[26]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).