Disentangling Style Factors from Speaker Representations

Our goal is to separate out speaking style from speaker identity in utterance-level representations of speech such as ivectors and x-vectors. We first show that both i-vectors and x-vectors contain information not only about speaker but also about speaking style (for one data set) or emotion (for another data set), even when projected into a low-dimensional space. To disentangle these factors, we use an autoencoder in which the latent space is split into two subspaces. The entangled information about speaker and style/emotion is pushed apart by the use of auxiliary classifiers that take one of the two latent subspaces as input and that are jointly learned with the autoencoder. We evaluate how well the latent subspaces separate the factors by using them as input to separate style/emotion classification tasks. In traditional speaker identification tasks, speakerinvariant characteristics are factorized from channel and then the channel information is ignored. Our results suggest that this so-called channel may contain exploitable information, which we refer to as style factors. Finally, we propose future work to use information theory to formalize style factors in the context of speaker identity.

[1]  Julia Hirschberg A Corpus-Based Approach to the Study of Speaking Style , 2000 .

[2]  Yutaka Matsuo,et al.  Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder , 2018, INTERSPEECH.

[3]  Yuxuan Wang,et al.  Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[4]  Jiahong Yuan,et al.  Pitch accent prediction: effects of genre and speaker , 2005, INTERSPEECH.

[5]  Hansjörg Mixdorff,et al.  Towards Objective Measures for Comparing Speaking Styles , 2005 .

[6]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[7]  Francis Nolan,et al.  The IViE Corpus , 2014 .

[8]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Mitchell McLaren,et al.  How to train your speaker embeddings extractor , 2018, Odyssey.

[11]  Srikanth Ronanki,et al.  Learning Interpretable Control Dimensions for Speech Synthesis by Using External Data , 2018, INTERSPEECH.

[12]  Ryo Masumura,et al.  Read and spontaneous speech classification based on variance of GMM supervectors , 2014, INTERSPEECH.

[13]  Yuxuan Wang,et al.  Uncovering Latent Style Factors for Expressive Speech Synthesis , 2017, ArXiv.

[14]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[16]  Themos Stafylakis,et al.  How to Improve Your Speaker Embeddings Extractor in Generic Toolkits , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[18]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[19]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[20]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[22]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[23]  Andrew Rosenberg,et al.  Symbolic and Direct Sequential Modeling of Prosody for Classification of Speaking-Style and Nativeness , 2011, INTERSPEECH.

[24]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[25]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[26]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[27]  Igor Jauk,et al.  Unsupervised Learning for Expressive Speech Synthesis , 2017, IberSPEECH.

[28]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[29]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[30]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[31]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).