An empirical analysis of information encoded in disentangled neural speaker representations

The primary characteristic of robust speaker representations is that they are invariant to factors of variability not related to speaker identity. Disentanglement of speaker representations is one of the techniques used to improve robustness of speaker representations to both intrinsic factors that are acquired during speech production (e.g., emotion, lexical content) and extrinsic factors that are acquired during signal capture (e.g., channel, noise). Disentanglement in neural speaker representations can be achieved either in a supervised fashion with annotations of the nuisance factors (factors not related to speaker identity) or in an unsupervised fashion without labels of the factors to be removed. In either case it is important to understand the extent to which the various factors of variability are entangled in the representations. In this work, we examine speaker representations with and without unsupervised disentanglement for the amount of information they capture related to a suite of factors. Using classification experiments we provide empirical evidence that disentanglement reduces the information with respect to nuisance factors from speaker representations, while retaining speaker information. This is further validated by speaker verification experiments on the VOiCES corpus in several challenging acoustic conditions. We also show improved robustness in speaker verification tasks using data augmentation during training of disentangled speaker embeddings. Finally, based on our findings, we provide insights into the factors that can be effectively separated using the unsupervised disentanglement technique and discuss potential future directions.

[1]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[3]  Simon King,et al.  Disentangling Style Factors from Speaker Representations , 2019, INTERSPEECH.

[4]  Premkumar Natarajan,et al.  Unsupervised Adversarial Invariance , 2018, NeurIPS.

[5]  Moon-Seog Jun,et al.  Home IoT device certification through speaker recognition , 2015, 2015 17th International Conference on Advanced Communication Technology (ICACT).

[6]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[7]  Shrikanth Narayanan,et al.  Robust Speaker Recognition Using Unsupervised Adversarial Invariance , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Pietro Laface,et al.  Compensation of Nuisance Factors for Speaker and Language Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Colleen Richey,et al.  The VOiCES from a Distance Challenge 2019 Evaluation Plan , 2019, ArXiv.

[10]  Thomas Fang Zheng,et al.  Emotion attribute projection for speaker recognition on emotional speech , 2007, INTERSPEECH.

[11]  Lukás Burget,et al.  Analysis and Optimization of Bottleneck Features for Speaker Recognition , 2016, Odyssey.

[12]  Chaouki Kasmi,et al.  Voice Biometrics , 2011, Encyclopedia of Cryptography and Security.

[13]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Sadaoki Furui,et al.  Speaker recognition , 1997, Scholarpedia.

[16]  Bin Ma,et al.  The reddots data collection for speaker recognition , 2015, INTERSPEECH.

[17]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[18]  John H. L. Hansen,et al.  A study of speaker verification performance with expressive speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Solange Rossato,et al.  Intra-speaker variability effects on Speaker Verification performance , 2010, Odyssey.

[20]  Tao Jiang,et al.  Training Multi-task Adversarial Network for Extracting Noise-robust Speaker Embedding , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Dong Wang,et al.  Deep Factorization for Speech Signal , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Najim Dehak,et al.  Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker verification , 2009 .

[24]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[25]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[26]  John H. L. Hansen,et al.  An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing , 2017, IEEE Journal of Selected Topics in Signal Processing.

[27]  Colleen Richey,et al.  Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings , 2018, INTERSPEECH.

[28]  Shuai Wang,et al.  What Does the Speaker Embedding Encode? , 2017, INTERSPEECH.

[29]  Hao Tang,et al.  Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[30]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[31]  Patrick Kenny,et al.  Generative Adversarial Speaker Embedding Networks for Domain Robust End-to-end Speaker Verification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[33]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Sanjeev Khudanpur,et al.  Probing the Information Encoded in X-Vectors , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[35]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Colleen Richey,et al.  Voices Obscured in Complex Environmental Settings (VOICES) corpus , 2018, INTERSPEECH.

[37]  William M. Campbell,et al.  Nuisance Attribute Projection , 2009, Encyclopedia of Biometrics.

[38]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[39]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.