Variational Domain Adversarial Learning for Speaker Verification

Domain mismatch refers to the problem in which the distribution of training data differs from that of the test data. This paper proposes a variational domain adversarial neural network (VDANN), which consists of a variational autoencoder (VAE) and a domain adversarial neural network (DANN), to reduce domain mismatch. The DANN part aims to retain speaker identity information and learn a feature space that is robust against domain mismatch, while the VAE part is to impose variational regularization on the learned features so that they follow a Gaussian distribution. Thus, the representation produced by VDANN is not only speaker discriminative and domaininvariant but also Gaussian distributed, which is essential for the standard PLDA backend. Experiments on both SRE16 and SRE18-CMN2 show that VDANN outperforms the Kaldi baseline and the standard DANN. The results also suggest that VAE regularization is effective for domain adaptation.

[1]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[2]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Eduardo Lleida,et al.  Bayesian adaptation of PLDA based speaker recognition to domains with scarce development data , 2012, Odyssey.

[4]  Man-Wai Mak,et al.  Unsupervised Domain Adaptation for Gender-Aware PLDA Mixture Models , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Sridha Sridharan,et al.  Dataset-invariant covariance normalization for out-domain PLDA speaker verification , 2015, INTERSPEECH.

[6]  Douglas A. Reynolds,et al.  Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems , 2014, Odyssey.

[7]  Jen-Tzung Chien,et al.  Semi-supervised Nuisance-attribute Networks for Domain Adaptation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Dong Wang,et al.  VAE-based regularization for deep speaker embedding , 2019, INTERSPEECH.

[9]  Jen-Tzung Chien,et al.  Multisource I-Vectors Domain Adaptation Using Maximum Mean Discrepancy Based Autoencoders , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[11]  Alan McCree,et al.  Improving speaker recognition performance in the domain adaptation challenge using deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[12]  Hagai Aronowitz,et al.  Inter dataset variability compensation for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jen-Tzung Chien,et al.  Adversarial domain separation and adaptation , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[14]  Hanseok Ko,et al.  Autoencoder Based Domain Adaptation for Speaker Recognition Under Insufficient Channel Information , 2017, INTERSPEECH.

[15]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[16]  Lukás Burget,et al.  Speaker Verification Using End-to-end Adversarial Language Adaptation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[18]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[19]  Dong Wang,et al.  Gaussian-constrained Training for Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Zhanyu Ma,et al.  Adversarial Network Bottleneck Features for Noise Robust Speaker Verification , 2017, INTERSPEECH.

[21]  Sridha Sridharan,et al.  Improving out-domain PLDA speaker verification using unsupervised inter-dataset variability compensation approach , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[23]  Alan McCree,et al.  Supervised domain adaptation for I-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[25]  Haizhou Li,et al.  Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[27]  William M. Campbell,et al.  Channel compensation for SVM speaker recognition , 2004, Odyssey.

[28]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[29]  Lukás Burget,et al.  Fast variational Bayes for heavy-tailed PLDA applied to i-vectors and x-vectors , 2018, INTERSPEECH.