Tacotron-Based Acoustic Model Using Phoneme Alignment for Practical Neural Text-to-Speech Systems

Although sequence-to-sequence (seq2seq) models with attention mechanism in neural text-to-speech (TTS) systems, such as Tacotron 2, can jointly optimize duration and acoustic models, and realize high-fidelity synthesis compared with conventional duration-acoustic pipeline models, these involve a risk that speech samples cannot be sometimes successfully synthesized due to the attention prediction errors. Therefore, these seq2seq models cannot be directly introduced in practical TTS systems. On the other hand, the conventional pipeline models are broadly used in practical TTS systems since there are few crucial prediction errors in the duration model. For realizing high-quality practical TTS systems without attention prediction errors, this paper investigates Tacotron-based acoustic models with phoneme alignment instead of attention. The phoneme durations are first obtained from HMM-based forced alignment and the duration model is a simple bidirectional LSTM-based network. Then, a seq2seq model with forced alignment instead of attention is investigated and an alternative model with Tacotron decoder and phoneme duration is proposed. The results of experiments with full-context label input using WaveGlow vocoder indicate that the proposed model can realize a high-fidelity TTS system for Japanese with a real-time factor of 0.13 using a GPU without attention prediction errors compared with the seq2seq models.

[1]  Szu-Lin Wu,et al.  Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[2]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[3]  Tomoki Toda,et al.  Model Integration for HMM- and DNN-Based Speech Synthesis Using Product-of-Experts Framework , 2016, INTERSPEECH.

[4]  Bo Chen,et al.  Discrete Duration Model for Speech Synthesis , 2017, INTERSPEECH.

[5]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jianwei Yu,et al.  End-to-end Code-switched TTS with Mix of Monolingual Recordings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[8]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[10]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jianhua Tao,et al.  Phoneme Dependent Speaker Embedding and Model Factorization for Multi-speaker Speech Synthesis and Adaptation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Srikanth Ronanki,et al.  Effect of Data Reduction on Sequence-to-sequence Neural TTS , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Hideyuki Tachibana,et al.  Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Luis A. Hernández Gómez,et al.  Automatic phonetic segmentation , 2003, IEEE Trans. Speech Audio Process..

[16]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[17]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[18]  Dong Yu,et al.  Enhancing Hybrid Self-attention Structure with Relative-position-aware Bias for Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[20]  Yuxuan Wang,et al.  Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Hideki Kawahara,et al.  Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT , 2005, INTERSPEECH.

[22]  Lauri Juvela,et al.  A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[24]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[25]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[26]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[28]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[29]  Taesu Kim,et al.  Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Lei Xie,et al.  Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis , 2019, IEEE Access.

[31]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[32]  Dong Yu,et al.  Quasi-fully Convolutional Neural Network with Variational Inference for Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Xin Wang,et al.  Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Tomoki Toda,et al.  Investigations of Real-time Gaussian Fftnet and Parallel Wavenet Neural Vocoders with Simple Acoustic Features , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Sang Wan Lee,et al.  Phonemic-level Duration Control Using Attention Alignment for Natural Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Tsuneo Kato,et al.  Investigating Context Features Hidden in End-to-end TTS , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[39]  Lei Xie,et al.  On the impact of phoneme alignment in DNN-based speech synthesis , 2016, SSW.

[40]  Junichi Yamagishi,et al.  An autoregressive recurrent mixture density network for parametric speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[42]  Tomoki Toda,et al.  Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders , 2019, INTERSPEECH.

[43]  Ying Chen,et al.  Implementing Prosodic Phrasing in Chinese End-to-end Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[45]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[46]  Adam Finkelstein,et al.  Fftnet: A Real-Time Speaker-Dependent Neural Vocoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[48]  Bhuvana Ramabhadran,et al.  Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks , 2014, INTERSPEECH.

[49]  Li-Rong Dai,et al.  Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Shuang Xu,et al.  First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention , 2016, INTERSPEECH.

[51]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[52]  Xin Wang,et al.  Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Tomoki Toda,et al.  Improving FFTNet Vocoder with Noise Shaping and Subband Approaches , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[54]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[55]  Zhen-Hua Ling,et al.  Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[57]  Yoshua Bengio,et al.  Representation Mixing for TTS Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Tomoki Toda,et al.  An Investigation of Noise Shaping with Perceptual Weighting for Wavenet-Based Speech Generation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.