论文信息 - In Other News: a Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

In Other News: a Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data. This makes creating models for multiple styles expensive and time-consuming. In this paper different styles of speech are analysed based on prosodic variations, from this a model is proposed to synthesise speech in the style of a newscaster, with just a few hours of supplementary data. We pose the problem of synthesising in a target style using limited data as that of creating a bi-style model that can synthesise both neutral-style and newscaster-style speech via a one-hot vector which factorises the two styles. We also propose conditioning the model on contextual word embeddings, and extensively evaluate it against neutral NTTS, and neutral concatenative-based synthesis. This model closes the gap in perceived style-appropriateness between natural recordings for newscaster-style of speech, and neutral speech synthesis by approximately two-thirds.

[1] Srikanth Ronanki,et al. Effect of Data Reduction on Sequence-to-sequence Neural TTS , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] R. Kubichek,et al. Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[3] Nick Campbell,et al. Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[4] Zhizheng Wu,et al. Deep neural network-guided unit selection synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Tomohiro Nakatani,et al. A method for fundamental frequency estimation and voicing decision: Application to infant utterances recorded in real acoustical environments , 2008, Speech Commun..

[6] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[7] Zhi-Jie Yan,et al. A Unified Trajectory Tiling Approach to High Quality Speech Rendering , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Yuxuan Wang,et al. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[9] Yannis Agiomyrgiannakis,et al. Google's Next-Generation Real-Time Unit-Selection Synthesizer Using Sequence-to-Sequence LSTM-Based Autoencoders , 2017, INTERSPEECH.

[10] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[11] Sercan Ömer Arik,et al. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[12] Adam Nadolski,et al. Comprehensive Evaluation of Statistical Speech Waveform Synthesis , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[13] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14] Paul Taylor,et al. The target cost formulation in unit selection speech synthesis , 2006, INTERSPEECH.

[15] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[16] Ming Zhou,et al. Close to Human Quality TTS with Transformer , 2018, ArXiv.

[17] Sanjoy Dasgupta,et al. Adaptive Control Processes , 2010, Encyclopedia of Machine Learning and Data Mining.

[18] Christian S. Perone,et al. Evaluation of sentence embeddings in downstream and linguistic probing tasks , 2018, ArXiv.

[19] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21] Xin Wang,et al. Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis , 2018, ArXiv.

[22] Yuxuan Wang,et al. Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[23] Sercan Ömer Arik,et al. Neural Voice Cloning with a Few Samples , 2018, NeurIPS.

[24] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26] Heiga Zen,et al. Sample Efficient Adaptive Text-to-Speech , 2018, ICLR.

[27] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[28] Bajibabu Bollepalli,et al. Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention , 2018, ArXiv.

[29] Peter L. Søndergaard,et al. A fast Griffin-Lim algorithm , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[30] Yuxuan Wang,et al. Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Yutaka Matsuo,et al. Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder , 2018, INTERSPEECH.

[32] Patrick Nguyen,et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[33] Heiga Zen,et al. Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[34] Thomas Drugman,et al. Robust universal neural vocoding , 2018, ArXiv.

[35] Luke S. Zettlemoyer,et al. AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[36] David A. Krubsack,et al. An autocorrelation pitch detector and voicing decision with confidence measures developed for noise-corrupted speech , 1991, IEEE Trans. Signal Process..