Dilated residual networks with multi-level attention for speaker verification

Abstract With the development of deep learning techniques, speaker verification (SV) systems based on deep neural network (DNN) achieve competitive performance compared with traditional i-vector-based works. Previous DNN-based SV methods usually employ time-delay neural network, limiting the extension of the network for an effective representation. Besides, existing attention mechanisms used in DNN-based SV systems are only applied to a single level of network architectures, leading to insufficiently extraction of important features. To address above issues, we propose an effective deep speaker embedding architecture for SV, which combines a residual connection of one-dimensional dilated convolutional layers, called dilated residual networks (DRNs), with a multi-level attention model. The DRNs can not only capture long time-frequency context information of features, but also exploit information from multiple layers of DNN. In addition, the multi-level attention model, which consists of two-dimensional convolutional block attention modules employed at the frame level and the vector-based attention utilized at the pooling layer, can emphasize important features at multiple levels of DNN. Experiments conducted on NIST SRE 2016 dataset show that the proposed architecture achieves a superior equal error rate (EER) of 7.094% and a better detection cost function (DCF16) of 0.552 compared with state-of-the-art methods. Furthermore, the ablation experiments demonstrate the effectiveness of dilated convolutions and the multi-level attention on SV tasks.

[1]  Moustapha Cissé,et al.  Fooling End-To-End Speaker Verification With Adversarial Examples , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Bowen Zhou,et al.  Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Kai Yu,et al.  Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[7]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[9]  Tengyue Bian,et al.  Self-attention based speaker recognition using Cluster-Range Loss , 2019, Neurocomputing.

[10]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).