Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis

This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimensions using a conditional variational auto-encoder (VAE) with an auto-regressive structure. Evaluation of reconstruction performance illustrates that the new structure does not degrade the model while allowing better interpretability. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.

[1]  Taesu Kim,et al.  Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Frank D. Wood,et al.  Learning Disentangled Representations with Semi-Supervised Deep Generative Models , 2017, NIPS.

[3]  Yann LeCun,et al.  Disentangling factors of variation in deep representation using adversarial training , 2016, NIPS.

[4]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[5]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[6]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[7]  Abeer Alwan,et al.  Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Zhen-Hua Ling,et al.  Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[10]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[11]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[12]  Yu Zhang,et al.  Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[13]  Dana H. Brooks,et al.  Structured Disentangled Representations , 2018, AISTATS.

[14]  Soroosh Mariooryad,et al.  Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis , 2019, ArXiv.

[15]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Hao Tang,et al.  Unsupervised Adaptation with Interpretable Disentangled Representations for Distant Conversational Speech Recognition , 2018, INTERSPEECH.

[17]  S. King,et al.  The Blizzard Challenge 2013 , 2013, The Blizzard Challenge 2013.

[18]  Zhiyuan Li,et al.  Semi-Supervised Learning by Disentangling and Self-Ensembling Over Stochastic Latent Space , 2019, MICCAI.

[19]  James Glass,et al.  Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Bernhard Schölkopf,et al.  Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , 2018, ICML.

[21]  Soroosh Mariooryad,et al.  Semi-Supervised Generative Modeling for Controllable Speech Synthesis , 2019, ICLR.

[22]  A GENERATIVE ADVERSARIAL NETWORK FOR STYLE MODELING IN A TEXT-TO-SPEECH SYSTEM , 2018 .

[23]  Sercan Ömer Arik,et al.  Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[24]  Andriy Mnih,et al.  Disentangling by Factorising , 2018, ICML.

[25]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[26]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[27]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[28]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[29]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[30]  Yuxuan Wang,et al.  Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Yutaka Matsuo,et al.  Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder , 2018, INTERSPEECH.

[32]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[33]  Abhishek Kumar,et al.  Variational Inference of Disentangled Latent Concepts from Unlabeled Observations , 2017, ICLR.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Hugo Larochelle,et al.  RNADE: The real-valued neural autoregressive density-estimator , 2013, NIPS.

[36]  Xin Wang,et al.  Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis , 2018, ArXiv.