Towards Learning Fine-Grained Disentangled Representations from Speech

Learning disentangled representations of high-dimensional data is currently an active research area. However, compared to the field of computer vision, less work has been done for speech processing. In this paper, we provide a review of two representative efforts on this topic and propose the novel concept of fine-grained disentangled speech representation learning.

[1]  Gilles Louppe,et al.  Learning to Pivot with Adversarial Networks , 2016, NIPS.

[2]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[3]  Hao Tang,et al.  Unsupervised Adaptation with Interpretable Disentangled Representations for Distant Conversational Speech Recognition , 2018, INTERSPEECH.

[4]  Björn Schuller,et al.  Computational Paralinguistics , 2013 .

[5]  Guillaume Lample,et al.  Fader Networks: Manipulating Images by Sliding Attributes , 2017, NIPS.

[6]  Vidhyasaharan Sethu,et al.  Speaker variability in emotion recognition - an adaptation based approach , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  James R. Glass,et al.  Extracting Domain Invariant Features by Unsupervised Learning for Robust Automatic Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[9]  E. Ambikairajah,et al.  Speaker Normalisation for Speech-Based Emotion Detection , 2007, 2007 15th International Conference on Digital Signal Processing.

[10]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[11]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[12]  Yuting Zhang,et al.  Learning to Disentangle Factors of Variation with Manifold Interaction , 2014, ICML.

[13]  Jean-Luc Dugelay,et al.  Face aging with conditional generative adversarial networks , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[14]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yu Zhang,et al.  Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[16]  Amir-Hossein Karimi,et al.  JADE: Joint Autoencoders for Dis-Entanglement , 2017, ArXiv.

[17]  Hao Tang,et al.  A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition , 2018, INTERSPEECH.

[18]  Carlos Busso,et al.  Iterative feature normalization for emotional speech detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Lin-Shan Lee,et al.  Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations , 2018, INTERSPEECH.

[21]  Carlos Busso,et al.  A personalized emotion recognition system using an unsupervised feature adaptation scheme , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).