Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully designed and well optimized so that it can implicitly extract useful linguistic features from the input features. In this paper we investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English along with comparisons with deep neural network (DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline systems also use autoregressive probabilistic modeling and a neural vocoder. We investigated systems from three aspects: a) model architecture, b) model parameter size, and c) language. For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we investigate two model parameter sizes. For the language aspect, we conduct listening tests in both Japanese and English to see if our findings can be generalized across languages. Our experiments suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.

[1]  Ren-Hua Wang,et al.  The USTC System for Blizzard Challenge 2010 , 2008 .

[2]  Hirokazu Kameoka,et al.  Direct Modeling of Frequency Spectra and Waveform Generation Based on Phase Recovery for DNN-Based Speech Synthesis , 2017, INTERSPEECH.

[3]  Yoshihiko Nankaku,et al.  Impacts of input linguistic feature representation on Japanese end-to-end speech synthesis , 2019 .

[4]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[5]  Mike Lewis,et al.  MelNet: A Generative Model for Audio in the Frequency Domain , 2019, ArXiv.

[6]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[7]  Heiga Zen,et al.  Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices , 2016, INTERSPEECH.

[8]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[9]  Yoshihiko Nankaku,et al.  Simultaneous Acoustic, Prosodic, and Phrasing Model Training for TTs Conversion Systems , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[10]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[11]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[12]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Xin Wang,et al.  Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[15]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[18]  Kevin Knight,et al.  Multi-Source Neural Translation , 2016, NAACL.

[19]  Zhizheng Wu,et al.  The Blizzard Challenge 2019 , 2019 .

[20]  J. Mixter Fast , 2012 .

[21]  Jiangyan Yi,et al.  Forward–Backward Decoding Sequence for Regularizing End-to-End TTS , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Ying Chen,et al.  Implementing Prosodic Phrasing in Chinese End-to-end Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  Shuang Xu,et al.  First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention , 2016, INTERSPEECH.

[27]  Dong Yu,et al.  Enhancing Hybrid Self-attention Structure with Relative-position-aware Bias for Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Hideyuki Tachibana,et al.  Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[30]  Simon King,et al.  A Comparison of Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis , 2019 .

[31]  Heng Lu,et al.  The Tencent speech synthesis system for Blizzard Challenge 2020 , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[32]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[33]  Tomoki Toda,et al.  Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders , 2019, INTERSPEECH.

[34]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Oliver Watts,et al.  Where do the improvements come from in sequence-to-sequence neural TTS? , 2019 .

[36]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[37]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[38]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[39]  Lei He,et al.  Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS , 2019, INTERSPEECH.

[40]  Tuomo Raitio,et al.  A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  Zhiyong Wu,et al.  Knowledge-Based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis , 2019, INTERSPEECH.

[42]  Wei Song,et al.  Building a mixed-lingual neural TTS system with only monolingual data , 2019, INTERSPEECH.

[43]  Shinnosuke Takamichi,et al.  Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra , 2019, Comput. Speech Lang..

[44]  Zhizheng Wu,et al.  From HMMS to DNNS: Where do the improvements come from? , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Xin Wang,et al.  Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects , 2018, INTERSPEECH.

[46]  Mike Schuster,et al.  Better Generative Models for Sequential Data Problems: Bidirectional Recurrent Mixture Density Networks , 1999, NIPS.

[47]  Taesu Kim,et al.  Learning pronunciation from a foreign language in speech synthesis networks , 2018, ArXiv.

[48]  Junichi Yamagishi,et al.  An autoregressive recurrent mixture density network for parametric speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Yoshihiko Nankaku,et al.  Trajectory training considering global variance for speech synthesis based on neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[52]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[53]  Oliver Watts,et al.  Unsupervised learning for text-to-speech synthesis , 2013 .

[54]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[55]  Lior Wolf,et al.  VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.

[56]  Alan W. Black,et al.  Flite: a small fast run-time synthesis engine , 2001, SSW.

[57]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[58]  Xin Wang,et al.  Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Douglas Eck,et al.  Music Transformer , 2018, 1809.04281.

[60]  Soroosh Mariooryad,et al.  Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Lauri Juvela,et al.  A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[63]  Heiga Zen,et al.  Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[64]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[65]  Wei Ping,et al.  Non-Autoregressive Neural Text-to-Speech , 2020, ICML.

[66]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[67]  Lei Xie,et al.  A New GAN-based End-to-End TTS Training Algorithm , 2019, INTERSPEECH.

[68]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[69]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[70]  Bajibabu Bollepalli,et al.  GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram , 2019, INTERSPEECH.

[71]  Hirokazu Kameoka,et al.  Generative adversarial network-based postfilter for statistical parametric speech synthesis , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[72]  Cassia Valentini-Botinhao,et al.  Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  Jason Taylor,et al.  Analysis of Pronunciation Learning in End-to-End Speech Synthesis , 2019, INTERSPEECH.

[74]  Kyubyong Park,et al.  CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages , 2019, INTERSPEECH.

[75]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[76]  Li-Rong Dai,et al.  Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[77]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[78]  Zhao Song,et al.  Parallel Neural Text-to-Speech , 2019, ArXiv.

[79]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[80]  S. King,et al.  Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis , 2013, SSW.

[81]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[82]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[83]  Alexander Gutkin,et al.  Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer , 2016, INTERSPEECH.

[84]  Lei Xie,et al.  Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis , 2019, IEEE Access.

[85]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[86]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[87]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[88]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[89]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[90]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[91]  Haizhou Li,et al.  Teacher-Student Training For Robust Tacotron-Based TTS , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[92]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[93]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[94]  Yoshua Bengio,et al.  Representation Mixing for TTS Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[95]  Tara N. Sainath,et al.  Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[96]  Junichi Yamagishi,et al.  Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama's voice using GAN, WaveNet and low-quality found data , 2018, Odyssey.

[97]  Yoshihiko Nankaku,et al.  Temporal modeling in neural network based statistical parametric speech synthesis , 2016, SSW.

[98]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[99]  Xin Wang,et al.  Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis , 2019, ArXiv.

[100]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .