论文信息 - Cross Attentive Pooling for Speaker Verification

Cross Attentive Pooling for Speaker Verification

The goal of this paper is text-independent speaker verification where utterances come from ‘in the wild’ videos and may contain irrelevant signal. While speaker verification is naturally a pair-wise problem, existing methods to produce the speaker embeddings are instance-wise. In this paper, we propose Cross Attentive Pooling (CAP) that utilises the context information across the reference-query pair to generate utterance-level embeddings that contain the most discriminative information for the pair-wise matching problem. Experiments are performed on the VoxCeleb dataset in which our method outperforms comparable pooling strategies.

Joon Son Chung | Seong Min Kye | Yoohwan Kwon

[1] Quan Wang,et al. Attention-Based Models for Text-Dependent Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Stefanos Zafeiriou,et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Georg Heigold,et al. End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Hoirin Kim,et al. Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs , 2020, INTERSPEECH.

[6] Joon Son Chung,et al. In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[7] Yoohwan Kwon,et al. Intra-class variation reduction of speaker representation in disentanglement framework , 2020, INTERSPEECH.

[8] Hoirin Kim,et al. Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification , 2019, INTERSPEECH.

[9] Jian Cheng,et al. Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[10] Ming Li,et al. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[11] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[12] Shuai Wang,et al. Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[13] Sanjeev Khudanpur,et al. Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[14] Frank Rudzicz,et al. Centroid-based Deep Metric Learning for Speaker Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Joon Son Chung,et al. VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge , 2019, ArXiv.

[16] Andrea Vedaldi,et al. Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[17] Colleen Richey,et al. Voices Obscured in Complex Environmental Settings (VOICES) corpus , 2018, INTERSPEECH.

[18] Dengxin Dai,et al. Unified Hypersphere Embedding for Speaker Recognition , 2018, ArXiv.

[19] Xilin Chen,et al. Cross Attention Network for Few-shot Classification , 2019, NeurIPS.

[20] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[21] Brejesh Lall,et al. Few Shot Speaker Recognition using Deep Neural Networks , 2019, ArXiv.

[22] Andrew Zisserman,et al. Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[23] Joon Son Chung,et al. Augmentation adversarial training for unsupervised speaker recognition , 2020, ArXiv.

[24] Joon Son Chung,et al. Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[25] Richard S. Zemel,et al. Prototypical Networks for Few-shot Learning , 2017, NIPS.

[26] Xing Ji,et al. CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27] Wu-Jun Li,et al. Ensemble Additive Margin Softmax for Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Jia Liu,et al. Large Margin Softmax Loss for Speaker Verification , 2019, INTERSPEECH.

[29] Sanjeev Khudanpur,et al. x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition , 2019, INTERSPEECH.

[30] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[31] Shuai Wang,et al. BUT System Description to VoxCeleb Speaker Recognition Challenge 2019 , 2019, ArXiv.

[32] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Joon Son Chung,et al. Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34] Patrick Kenny,et al. Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.