Mixup Learning Strategies for Text-Independent Speaker Verification

Mixup is a learning strategy that constructs additional virtual training samples from existing training samples by linearly interpolating random pairs of them. It has been shown that mixup can help avoid data memorization and thus improve model generalization. This paper investigates the mixup learning strategy in training speaker-discriminative deep neural network (DNN) for better text-independent speaker verification. In recent speaker verification systems, a DNN is usually trained to classify speakers in the training set. The DNN, at the same time, learns a low-dimensional embedding of speakers so that speaker embeddings can be generated for any speakers during evaluation. We adapted the mixup strategy to the speakerdiscriminative DNN training procedure, and studied different mixup schemes, such as performing mixup on MFCC features or raw audio samples. The mixup learning strategy was evaluated on NIST SRE 2010, 2016 and SITW evaluation sets. Experimental results show consistent performance improvements both in terms of EER and DCF of up to 13% relative. We further find that mixup training also improves the DNN’s speaker classification accuracy consistently without requiring any additional data sources.

[1]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[2]  Douglas E. Sturim,et al.  Speaker adaptive cohort selection for Tnorm in text-independent speaker verification , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[4]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[7]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[8]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  Daniel Povey,et al.  Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[11]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[13]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[14]  Yann LeCun,et al.  Transformation Invariance in Pattern Recognition-Tangent Distance and Tangent Propagation , 1996, Neural Networks: Tricks of the Trade.

[15]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Aaron Lawson,et al.  The 2016 Speakers in the Wild Speaker Recognition Evaluation , 2016, INTERSPEECH.

[18]  Dmitry Popov,et al.  An Investigation of Mixup Training Strategies for Acoustic Models in ASR , 2018, INTERSPEECH.