论文信息 - A Variational Prosody Model for the decomposition and synthesis of speech prosody

A Variational Prosody Model for the decomposition and synthesis of speech prosody

The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-to-end mappings using millions of tunable parameters. The shift towards machine learning models has nonetheless posed the reverse problem - a compelling need to discover knowledge, to explain, visualise and interpret. Our work bridges between a comprehensive generative model of intonation and state-of-the-art AI techniques. We build upon the modelling paradigm of the Superposition of Functional Contours model and propose a Variational Prosody Model (VPM) that uses a network of deep variational contour generators to capture the context-sensitive variation of the constituent elementary prosodic cliches. We show that the VPM can give insight into the intrinsic variability of these prosodic prototypes through learning a meaningful prosodic latent space representation structure. We also show that the VPM brings improved modelling performance especially when such variability is prominent. In a speech synthesis scenario we believe the model can be used to generate a dynamic and natural prosody contour largely devoid of averaging effects.

[1] Adam Coates,et al. Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[2] Eva Gårding,et al. A Generative Model of Intonation , 1983 .

[3] Gérard Bailly,et al. The significance of scope in modelling tones in Chinese , 2018 .

[4] Yi Xu,et al. Effects of tone and focus on the formation and alignment of f0contours , 1999 .

[5] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Daniel Hirst,et al. Form and function in the representation of speech prosody , 2005, Speech Commun..

[7] Mari Ostendorf,et al. TOBI: a standard for labeling English prosody , 1992, ICSLP.

[8] Gérard Bailly,et al. Generating prosodic attitudes in French: Data, model and evaluation , 2001, Speech Commun..

[9] Gérard Bailly,et al. A Weighted Superposition of Functional Contours Model for Modelling Contextual Prominence of Elementary Prosodic Contours , 2018, INTERSPEECH.

[10] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[11] Wade Junek,et al. Mind Reading: The Interactive Guide to Emotions , 2007 .

[12] Gérard Bailly,et al. SFC: A trainable prosodic model , 2005, Speech Commun..

[13] Gérard Bailly,et al. A superposed prosodic model for Chinese text-to-speech synthesis , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[14] IVAN FÓNAGY,et al. CLICHÉS MÉLODIQUES , 1983 .

[15] Gérard Bailly,et al. Evaluating the adequacy of synthetic prosody in signaling syntactic boundaries : methodology and first results , 1998 .

[16] Yann Morlec. Génération multiparamétrique de la prosodie du français par apprentissage automatique , 1997 .

[17] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.