Transformer-Based Text-to-Speech with Weighted Forced Attention
暂无分享,去创建一个
Tomoki Toda | Hisashi Kawai | Yoshinori Shiga | Takuma Okamoto | T. Toda | T. Okamoto | H. Kawai | Y. Shiga
[1] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[2] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[3] Ying Chen,et al. Implementing Prosodic Phrasing in Chinese End-to-end Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[4] Tomoki Toda,et al. Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.
[5] Frank K. Soong,et al. TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.
[6] METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .
[7] Xiaofei Wang,et al. A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
[8] Yoshihiko Nankaku,et al. Impacts of input linguistic feature representation on Japanese end-to-end speech synthesis , 2019 .
[9] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.
[10] Sang Wan Lee,et al. Phonemic-level Duration Control Using Attention Alignment for Natural Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[11] Yoshua Bengio,et al. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.
[12] Tomoki Toda,et al. Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders , 2019, INTERSPEECH.
[13] Bhuvana Ramabhadran,et al. Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks , 2014, INTERSPEECH.
[14] Lei Xie,et al. Exploiting Syntactic Features in a Parsed Tree to Improve End-to-End TTS , 2019, INTERSPEECH.
[15] Hisashi Kawai,et al. Tacotron-Based Acoustic Model Using Phoneme Alignment for Practical Neural Text-to-Speech Systems , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
[16] Sercan Ömer Arik,et al. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.
[17] Lei Xie,et al. Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis , 2019, IEEE Access.
[18] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[19] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.
[20] Xu Tan,et al. FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.
[21] Hideyuki Tachibana,et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[22] Luis A. Hernández Gómez,et al. Automatic phonetic segmentation , 2003, IEEE Trans. Speech Audio Process..
[23] Shujie Liu,et al. Neural Speech Synthesis with Transformer Network , 2018, AAAI.
[24] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..
[25] Oliver Watts,et al. Where do the improvements come from in sequence-to-sequence neural TTS? , 2019 .
[26] Tomoki Toda,et al. Model Integration for HMM- and DNN-Based Speech Synthesis Using Product-of-Experts Framework , 2016, INTERSPEECH.
[27] Ryan Prenger,et al. Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[28] Xin Wang,et al. Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[29] Yoshua Bengio,et al. Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.
[30] Alexander M. Rush,et al. Sequence-Level Knowledge Distillation , 2016, EMNLP.
[31] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.