Cross Attentive Pooling for Speaker Verification

The goal of this paper is text-independent speaker verification where utterances come from ‘in the wild’ videos and may contain irrelevant signal. While speaker verification is naturally a pair-wise problem, existing methods to produce the speaker embeddings are instance-wise. In this paper, we propose Cross Attentive Pooling (CAP) that utilises the context information across the reference-query pair to generate utterance-level embeddings that contain the most discriminative information for the pair-wise matching problem. Experiments are performed on the VoxCeleb dataset in which our method outperforms comparable pooling strategies.

[1]  Quan Wang,et al.  Attention-Based Models for Text-Dependent Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hoirin Kim,et al.  Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs , 2020, INTERSPEECH.

[6]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[7]  Yoohwan Kwon,et al.  Intra-class variation reduction of speaker representation in disentanglement framework , 2020, INTERSPEECH.

[8]  Hoirin Kim,et al.  Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification , 2019, INTERSPEECH.

[9]  Jian Cheng,et al.  Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[10]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[11]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[12]  Shuai Wang,et al.  Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[13]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[14]  Frank Rudzicz,et al.  Centroid-based Deep Metric Learning for Speaker Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Joon Son Chung,et al.  VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge , 2019, ArXiv.

[16]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[17]  Colleen Richey,et al.  Voices Obscured in Complex Environmental Settings (VOICES) corpus , 2018, INTERSPEECH.

[18]  Dengxin Dai,et al.  Unified Hypersphere Embedding for Speaker Recognition , 2018, ArXiv.

[19]  Xilin Chen,et al.  Cross Attention Network for Few-shot Classification , 2019, NeurIPS.

[20]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[21]  Brejesh Lall,et al.  Few Shot Speaker Recognition using Deep Neural Networks , 2019, ArXiv.

[22]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[23]  Joon Son Chung,et al.  Augmentation adversarial training for unsupervised speaker recognition , 2020, ArXiv.

[24]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[25]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[26]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Wu-Jun Li,et al.  Ensemble Additive Margin Softmax for Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Jia Liu,et al.  Large Margin Softmax Loss for Speaker Verification , 2019, INTERSPEECH.

[29]  Sanjeev Khudanpur,et al.  x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition , 2019, INTERSPEECH.

[30]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[31]  Shuai Wang,et al.  BUT System Description to VoxCeleb Speaker Recognition Challenge 2019 , 2019, ArXiv.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Patrick Kenny,et al.  Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.