Applying the Information Bottleneck Principle to Prosodic Representation Learning

This paper describes a novel design of a neural network-based speech generation model for learning prosodic representation. The problem of representation learning is formulated according to the information bottleneck (IB) principle. A modified VQVAE quantized layer is incorporated in the speech generation model to control the IB capacity and adjust the balance between reconstruction power and disentangle capability of the learned representation. The proposed model is able to learn word-level prosodic representations from speech data. With an optimized IB capacity, the learned representations not only are adequate to reconstruct the original speech but also can be used to transfer the prosody onto different textual content. Extensive results of the objective and subjective evaluation are presented to demonstrate the effect of IB capacity control, the effectiveness, and potential usage of the learned prosodic representation in controllable neural speech generation.

[1]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[3]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[4]  Taesu Kim,et al.  Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[6]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[7]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[8]  Tan Lee,et al.  Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation , 2020, INTERSPEECH.

[9]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[10]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[11]  Guillaume Desjardins,et al.  Understanding disentangling in $\beta$-VAE , 2018, 1804.03599.

[12]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[13]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[14]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[15]  Heiga Zen,et al.  Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Vincent Wan,et al.  CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network , 2019, ICML.

[17]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[18]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[19]  Duane G. Watson,et al.  Experimental and theoretical advances in prosody: A review , 2010, Language and cognitive processes.

[20]  Thomas Drugman,et al.  CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech , 2020, INTERSPEECH.

[21]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[22]  James Glass,et al.  Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Oliver Watts,et al.  Using generative modelling to produce varied intonation for speech synthesis , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[26]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[27]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Tan Lee,et al.  Estimating Mutual Information in Prosody Representation for Emotional Prosody Transfer in Speech Synthesis , 2021, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[30]  Simon King,et al.  A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[32]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[33]  Tan Lee,et al.  Fine-Grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement , 2020, Interspeech.

[34]  Xiaodong Liu,et al.  Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing , 2019, NAACL.

[35]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[36]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).