A Variational Prosody Model for Mapping the Context-Sensitive Variation of Functional Prosodic Prototypes

The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. Traditional intonation models have given way to the overwhelming performance of deep learning (DL) techniques for training general purpose end-to-end mappings using millions of tunable parameters. The shift towards black box machine learning models has nonetheless posed the reverse problem -- a compelling need to discover knowledge, to explain, visualise and interpret. Our work bridges between a comprehensive generative model of intonation and state-of-the-art DL techniques. We build upon the modelling paradigm of the Superposition of Functional Contours (SFC) model and propose a Variational Prosody Model (VPM) that uses a network of variational contour generators to capture the context-sensitive variation of the constituent elementary prosodic contours. We show that the VPM can give insight into the intrinsic variability of these prosodic prototypes through learning a meaningful prosodic latent space representation structure. We also show that the VPM is able to capture prosodic phenomena that have multiple dimensions of context based variability. Since it is based on the principle of superposition, the VPM does not necessitate the use of specially crafted corpora for the analysis, opening up the possibilities of using big data for prosody analysis. In a speech synthesis scenario, the model can be used to generate a dynamic and natural prosody contour that is devoid of averaging effects.

[1]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[2]  Gérard Bailly,et al.  Evaluating the adequacy of synthetic prosody in signaling syntactic boundaries : methodology and first results , 1998 .

[3]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Anne Lacheret,et al.  Stylization and Trajectory Modelling of Short and Long Term Speech Prosody Variations , 2011, INTERSPEECH.

[5]  Yi Xu Contextual tonal variations in Mandarin , 1997 .

[6]  Yann Morlec Génération multiparamétrique de la prosodie du français par apprentissage automatique , 1997 .

[7]  A GENERATIVE ADVERSARIAL NETWORK FOR STYLE MODELING IN A TEXT-TO-SPEECH SYSTEM , 2018 .

[8]  IVAN FÓNAGY,et al.  CLICHÉS MÉLODIQUES , 1983 .

[9]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[10]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[11]  Gérard Bailly,et al.  SFC: A trainable prosodic model , 2005, Speech Commun..

[12]  Gérard Bailly,et al.  A superposed prosodic model for Chinese text-to-speech synthesis , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[13]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[14]  Wade Junek,et al.  Mind Reading: The Interactive Guide to Emotions , 2007 .

[15]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[16]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[17]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[18]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[19]  Heiga Zen,et al.  Statistical parametric speech synthesis: from HMM to LSTM-RNN , 2015 .

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Lior Wolf,et al.  VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.

[22]  Michael Wagner,et al.  Toward a bestiary of English intonational contours* , 2016 .

[23]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[24]  Frank K. Soong,et al.  Modeling F0 trajectories in hierarchically structured deep neural networks , 2016, Speech Commun..

[25]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[26]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[27]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[28]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[29]  Fang Liu,et al.  Parallel Encoding of Focus and Interrogative Meaning in Mandarin Intonation , 2005, Phonetica.

[30]  Yi Xu,et al.  Speech melody as articulatorily implemented communicative functions , 2005, Speech Commun..

[31]  G. Bailly,et al.  LEARNING THE HIDDEN STRUCTURE OF SPEECH: FROM COMMUNICATIVE FUNCTIONS TO PROSODY , 2011 .

[32]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[33]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[34]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[35]  Gérard Bailly,et al.  A Weighted Superposition of Functional Contours Model for Modelling Contextual Prominence of Elementary Prosodic Contours , 2018, INTERSPEECH.

[36]  Cheng-Yuan Liou,et al.  Autoencoder for words , 2014, Neurocomputing.

[37]  I. Biederman Recognition-by-components: a theory of human image understanding. , 1987, Psychological review.

[38]  Eva Gårding,et al.  A Generative Model of Intonation , 1983 .

[39]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.

[40]  Xin Wang,et al.  An RNN-Based Quantized F0 Model with Multi-Tier Feedback Links for Text-to-Speech Synthesis , 2017, INTERSPEECH.

[41]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[42]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[44]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Paul W. Munro,et al.  Principal Components Analysis Of Images Via Back Propagation , 1988, Other Conferences.

[46]  Gérard Bailly,et al.  The significance of scope in modelling tones in Chinese , 2018 .

[47]  Daniel Hirst,et al.  Form and function in the representation of speech prosody , 2005, Speech Commun..

[48]  Nicolas Obin,et al.  Sparse Coding of Pitch Contours with Deep Auto-Encoders , 2018, Speech Prosody 2018.

[49]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[50]  Yi Xu,et al.  Effects of tone and focus on the formation and alignment of f0contours , 1999 .

[51]  R. Wilcox Introduction to Robust Estimation and Hypothesis Testing , 1997 .

[52]  Gérard Bailly,et al.  Talking Machines: Theories, Models, and Designs , 1992 .

[53]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[55]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[56]  Gérard Bailly,et al.  Generating prosodic attitudes in French: Data, model and evaluation , 2001, Speech Commun..