Mixture Factorized Auto-Encoder for Unsupervised Hierarchical Deep Factorization of Speech Signal

Speech signal is constituted and contributed by various informative factors, such as linguistic content and speaker characteristic. There have been notable recent studies attempting to factorize speech signal into these individual factors without requiring any annotation. These studies typically assume continuous representation for linguistic content, which is not in accordance with general linguistic knowledge and may make the extraction of speaker information less successful. This paper proposes the mixture factorized auto-encoder (mFAE) for unsupervised deep factorization. The encoder part of mFAE comprises a frame tokenizer and an utterance embedder. The frame tokenizer models linguistic content of input speech with a discrete categorical distribution. It performs frame clustering by assigning each frame a soft mixture label. The utterance embedder generates an utterance-level vector representation. A frame decoder serves to reconstruct speech features from the encoders’ outputs. The mFAE is evaluated on speaker verification (SV) task and unsupervised subword modeling (USM) task. The SV experiments on VoxCeleb 1 show that the utterance embedder is capable of extracting speaker-discriminative embeddings with performance comparable to a x-vector baseline. The USM experiments on ZeroSpeech 2017 dataset verify that the frame tokenizer is able to capture linguistic content and the utterance embedder can acquire speaker-related information.

[1]  Mohammad Norouzi,et al.  Understanding Posterior Collapse in Generative Latent Variable Models , 2019, DGS@ICLR.

[2]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[4]  Lukás Burget,et al.  Self-supervised speaker embeddings , 2019, INTERSPEECH.

[5]  Yun Lei,et al.  Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Stephan Mandt,et al.  Disentangled Sequential Autoencoder , 2018, ICML.

[7]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[8]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[9]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[10]  Sanjeev Khudanpur,et al.  Spoken Language Recognition using X-vectors , 2018, Odyssey.

[11]  Tom Minka,et al.  A* Sampling , 2014, NIPS.

[12]  Dong Wang,et al.  Deep Factorization for Speech Signal , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[14]  Dong Wang,et al.  VAE-based regularization for deep speaker embedding , 2019, INTERSPEECH.

[15]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16]  Emilien Dupont,et al.  Joint-VAE: Learning Disentangled Joint Continuous and Discrete Representations , 2018, NeurIPS.

[17]  Yi Liu,et al.  Speaker Embedding Extraction with Phonetic Information , 2018, INTERSPEECH.

[18]  Aleksandr Sizov,et al.  Unifying Probabilistic Linear Discriminant Analysis Variants in Biometric Authentication , 2014, S+SSPR.

[19]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[21]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[22]  Yu Zhang,et al.  Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.