论文信息 - Camp: A Two-Stage Approach to Modelling Prosody in Context

Camp: A Two-Stage Approach to Modelling Prosody in Context

Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a context-dependent prior over our prosodic space. Our context-aware model of prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly.

[1] Yusuke Miyao,et al. Learning with Lookahead: Can History-Based Models Rival Globally Optimized Models? , 2011, CoNLL.

[2] Method for the subjective assessment of intermediate quality level of , 2014 .

[3] Robert A. J. Clark,et al. A multi-level representation of f0 using the continuous wavelet transform and the Discrete Cosine Transform , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Martti Vainio,et al. Hierarchical Representation of Prosody for Statistical Speech Synthesis , 2015, ArXiv.

[5] Michael Wagner,et al. Toward a bestiary of English intonational contours* , 2016 .

[6] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[7] Timo Baumann,et al. An Empirical Analysis of the Correlation of Syntax and Prosody , 2018, INTERSPEECH.

[8] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[10] Simon King,et al. Using Pupillometry to Measure the Cognitive Load of Synthetic Speech , 2018, INTERSPEECH.

[11] Yuxuan Wang,et al. Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[12] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[13] Nigel G. Ward. Prosodic Patterns in English Conversation , 2019 .

[14] Rob Clark,et al. Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[15] Chengzhu Yu,et al. DurIAN: Duration Informed Attention Network For Multimodal Synthesis , 2019, ArXiv.

[16] Vincent Wan,et al. CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network , 2019, ICML.

[17] Srikanth Ronanki,et al. Prosody generation for text-to-speech synthesis , 2019 .

[18] Oliver Watts,et al. Using generative modelling to produce varied intonation for speech synthesis , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[19] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[20] Shujie Liu,et al. Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[21] James R. Glass,et al. Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models , 2019, ArXiv.

[22] Srikanth Ronanki,et al. Fine-grained robust prosody transfer for single-speaker neural text-to-speech , 2019, INTERSPEECH.

[23] Tomoki Toda,et al. Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis , 2019, INTERSPEECH.

[24] Sakriani Sakti,et al. The Zero Resource Speech Challenge 2019: TTS without T , 2019, INTERSPEECH.

[25] Thomas Drugman,et al. CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech , 2020, INTERSPEECH.

[26] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[27] Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0 , 2020, Speech Prosody 2020.

[28] Thomas Drugman,et al. Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection , 2019, INTERSPEECH.

[29] Simon King,et al. A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30] Tao Qin,et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[31] Anna Rumshisky,et al. A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.