A Variational Prosody Model for the decomposition and synthesis of speech prosody

The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-to-end mappings using millions of tunable parameters. The shift towards machine learning models has nonetheless posed the reverse problem - a compelling need to discover knowledge, to explain, visualise and interpret. Our work bridges between a comprehensive generative model of intonation and state-of-the-art AI techniques. We build upon the modelling paradigm of the Superposition of Functional Contours model and propose a Variational Prosody Model (VPM) that uses a network of deep variational contour generators to capture the context-sensitive variation of the constituent elementary prosodic cliches. We show that the VPM can give insight into the intrinsic variability of these prosodic prototypes through learning a meaningful prosodic latent space representation structure. We also show that the VPM brings improved modelling performance especially when such variability is prominent. In a speech synthesis scenario we believe the model can be used to generate a dynamic and natural prosody contour largely devoid of averaging effects.

[1]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[2]  Eva Gårding,et al.  A Generative Model of Intonation , 1983 .

[3]  Gérard Bailly,et al.  The significance of scope in modelling tones in Chinese , 2018 .

[4]  Yi Xu,et al.  Effects of tone and focus on the formation and alignment of f0contours , 1999 .

[5]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Daniel Hirst,et al.  Form and function in the representation of speech prosody , 2005, Speech Commun..

[7]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[8]  Gérard Bailly,et al.  Generating prosodic attitudes in French: Data, model and evaluation , 2001, Speech Commun..

[9]  Gérard Bailly,et al.  A Weighted Superposition of Functional Contours Model for Modelling Contextual Prominence of Elementary Prosodic Contours , 2018, INTERSPEECH.

[10]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[11]  Wade Junek,et al.  Mind Reading: The Interactive Guide to Emotions , 2007 .

[12]  Gérard Bailly,et al.  SFC: A trainable prosodic model , 2005, Speech Commun..

[13]  Gérard Bailly,et al.  A superposed prosodic model for Chinese text-to-speech synthesis , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[14]  IVAN FÓNAGY,et al.  CLICHÉS MÉLODIQUES , 1983 .

[15]  Gérard Bailly,et al.  Evaluating the adequacy of synthetic prosody in signaling syntactic boundaries : methodology and first results , 1998 .

[16]  Yann Morlec Génération multiparamétrique de la prosodie du français par apprentissage automatique , 1997 .

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[19]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[20]  I. Biederman Recognition-by-components: a theory of human image understanding. , 1987, Psychological review.

[21]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[22]  Yi Xu,et al.  Speech melody as articulatorily implemented communicative functions , 2005, Speech Commun..

[23]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[24]  Fang Liu,et al.  Parallel Encoding of Focus and Interrogative Meaning in Mandarin Intonation , 2005, Phonetica.

[25]  G. Bailly,et al.  LEARNING THE HIDDEN STRUCTURE OF SPEECH: FROM COMMUNICATIVE FUNCTIONS TO PROSODY , 2011 .

[26]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[27]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.