Investigating Domain Sensitivity of DNN Embeddings for Speaker Recognition Systems

A speaker embeddings framework achieves state-of-the-art speaker recognition performance by modeling speaker discriminant information directly using deep neural networks (DNNs). After the introduction of neural network based speaker embeddings, researchers have explored the requirements for training an effective embeddings network. However, the domain of the data used for system development should match the domain of operation for optimal performance. In this paper, we investigate the sensitivity of domain mismatch in the embeddings space. Specifically, degradation in performance is observed when back-end scoring with embeddings is performed with out-domain data. To compensate for the domain mismatch, we propose two novel deep domain adaptation techniques based on autoencoder architectures trained on embeddings in an unsupervised fashion. The results show that domain mismatch can be compensated effectively using autoencoders to adapt the out-domain data to in-domain.

[1]  Sridha Sridharan,et al.  Dataset-invariant covariance normalization for out-domain PLDA speaker verification , 2015, INTERSPEECH.

[2]  Hanseok Ko,et al.  Autoencoder Based Domain Adaptation for Speaker Recognition Under Insufficient Channel Information , 2017, INTERSPEECH.

[3]  Jen-Tzung Chien,et al.  Reducing Domain Mismatch by Maximum Mean Discrepancy Based Autoencoders , 2018, Odyssey.

[4]  Kate Saenko,et al.  Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.

[5]  Hitoshi Yamamoto,et al.  Domain adaptation using maximum likelihood linear transformation for PLDA-based speaker verification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Gabriela Csurka,et al.  A Comprehensive Survey on Domain Adaptation for Visual Applications , 2017, Domain Adaptation in Computer Vision Applications.

[7]  Niko Brümmer,et al.  Unsupervised Domain Adaptation for I-Vector Speaker Recognition , 2014, Odyssey.

[8]  Sridha Sridharan,et al.  Domain Mismatch Modeling of Out-Domain i-Vectors for PLDA Speaker Verification , 2017, INTERSPEECH.

[9]  Sridha Sridharan,et al.  Investigating in-domain data requirements for PLDA training , 2015, INTERSPEECH.

[10]  L. Burget,et al.  Promoting robustness for speaker modeling in the community: the PRISM evaluation set , 2011 .

[11]  Eduardo Lleida,et al.  Unsupervised adaptation of PLDA by using variational Bayes methods , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Spyridon Matsoukas,et al.  Domain adaptation via within-class covariance correction in I-vector based speaker recognition systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Hagai Aronowitz,et al.  Inter dataset variability compensation for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[16]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Patrick Kenny,et al.  Speaker Verification in Mismatched Conditions with Frustratingly Easy Domain Adaptation , 2018, Odyssey.

[18]  Eduardo Lleida,et al.  Bayesian adaptation of PLDA based speaker recognition to domains with scarce development data , 2012, Odyssey.

[19]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[20]  Mitchell McLaren,et al.  How to train your speaker embeddings extractor , 2018, Odyssey.

[21]  Yun Lei,et al.  Advances in deep neural network approaches to speaker recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[23]  Douglas E. Sturim,et al.  Speaker adaptive cohort selection for Tnorm in text-independent speaker verification , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[24]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[25]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[26]  Alan McCree,et al.  Supervised domain adaptation for I-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Sridha Sridharan,et al.  Domain-invariant I-vector Feature Extraction for PLDA Speaker Verification , 2018, Odyssey.

[28]  Gabriela Csurka,et al.  Domain Adaptation for Visual Applications: A Comprehensive Survey , 2017, ArXiv.

[29]  Pascal Fua,et al.  Beyond Sharing Weights for Deep Domain Adaptation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Yun Lei,et al.  Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Sridha Sridharan,et al.  Improving PLDA speaker verification performance using domain mismatch compensation techniques , 2018, Comput. Speech Lang..

[32]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[33]  Kate Saenko,et al.  Return of Frustratingly Easy Domain Adaptation , 2015, AAAI.

[34]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[35]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.