Camp: A Two-Stage Approach to Modelling Prosody in Context

Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a context-dependent prior over our prosodic space. Our context-aware model of prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly.

[1]  Yusuke Miyao,et al.  Learning with Lookahead: Can History-Based Models Rival Globally Optimized Models? , 2011, CoNLL.

[2]  Method for the subjective assessment of intermediate quality level of , 2014 .

[3]  Robert A. J. Clark,et al.  A multi-level representation of f0 using the continuous wavelet transform and the Discrete Cosine Transform , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Martti Vainio,et al.  Hierarchical Representation of Prosody for Statistical Speech Synthesis , 2015, ArXiv.

[5]  Michael Wagner,et al.  Toward a bestiary of English intonational contours* , 2016 .

[6]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[7]  Timo Baumann,et al.  An Empirical Analysis of the Correlation of Syntax and Prosody , 2018, INTERSPEECH.

[8]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[10]  Simon King,et al.  Using Pupillometry to Measure the Cognitive Load of Synthetic Speech , 2018, INTERSPEECH.

[11]  Yuxuan Wang,et al.  Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[12]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[13]  Nigel G. Ward Prosodic Patterns in English Conversation , 2019 .

[14]  Rob Clark,et al.  Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[15]  Chengzhu Yu,et al.  DurIAN: Duration Informed Attention Network For Multimodal Synthesis , 2019, ArXiv.

[16]  Vincent Wan,et al.  CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network , 2019, ICML.

[17]  Srikanth Ronanki,et al.  Prosody generation for text-to-speech synthesis , 2019 .

[18]  Oliver Watts,et al.  Using generative modelling to produce varied intonation for speech synthesis , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[19]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[20]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[21]  James R. Glass,et al.  Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models , 2019, ArXiv.

[22]  Srikanth Ronanki,et al.  Fine-grained robust prosody transfer for single-speaker neural text-to-speech , 2019, INTERSPEECH.

[23]  Tomoki Toda,et al.  Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis , 2019, INTERSPEECH.

[24]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2019: TTS without T , 2019, INTERSPEECH.

[25]  Thomas Drugman,et al.  CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech , 2020, INTERSPEECH.

[26]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[27]  Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0 , 2020, Speech Prosody 2020.

[28]  Thomas Drugman,et al.  Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection , 2019, INTERSPEECH.

[29]  Simon King,et al.  A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[31]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.