论文信息 - Text-Independent Speaker Verification with Dual Attention Network

Text-Independent Speaker Verification with Dual Attention Network

This paper presents a novel design of attention model for text-independent speaker verification. The model takes a pair of input utterances and generates an utterance-level embedding to represent speaker-specific characteristics in each utterance. The input utterances are expected to have highly similar embeddings if they are from the same speaker. The proposed attention model consists of a self-attention module and a mutual attention module, which jointly contributes to the generation of the utterance-level embedding. The self-attention weights are computed from the utterance itself while the mutual-attention weights are computed with the involvement of the other utterance in the input pairs. As a result, each utterance is represented by a self-attention weighted embedding and a mutual-attention weighted embedding. The similarity between the embeddings is measured by a cosine distance score and a binary classifier output score. The whole model, named Dual Attention Network, is trained end-to-end on Voxceleb database. The evaluation results on Voxceleb 1 test set show that the Dual Attention Network significantly outperforms the baseline systems. The best result yields an equal error rate of 1:6%.

Tan Lee | Jingyu Li

[1] Sanjeev Khudanpur,et al. Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[2] Jr. J.P. Campbell,et al. Speaker recognition: a tutorial , 1997, Proc. IEEE.

[3] Dong Yu,et al. Boundary Discriminative Large Margin Cosine Loss for Text-independent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Jian Cheng,et al. Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Erik McDermott,et al. Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Xing Ji,et al. CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[9] Patrick Kenny,et al. Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.

[10] Chengzhu Yu,et al. Seq2Seq Attentional Siamese Neural Networks for Text-dependent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Jia Liu,et al. Large Margin Softmax Loss for Speaker Verification , 2019, INTERSPEECH.

[12] Ian McLoughlin,et al. An Effective Deep Embedding Learning Architecture for Speaker Verification , 2019, INTERSPEECH.

[13] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[14] Colin Raffel,et al. librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[15] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[16] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[17] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Hermann Ney,et al. Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[19] Joon Son Chung,et al. Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[20] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[21] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[23] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[24] Koichi Shinoda,et al. Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[25] Xiaogang Wang,et al. SCAN: Self-and-Collaborative Attention Network for Video Person Re-Identification , 2018, IEEE Transactions on Image Processing.

[26] Georg Heigold,et al. End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Sanjeev Khudanpur,et al. A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).