Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit
暂无分享,去创建一个
Tomoki Toda | Shinji Watanabe | Yu Zhang | Kazuya Takeda | Ryuichi Yamamoto | Tomoki Hayashi | Takenori Yoshimura | Katsuki Inoue | Xu Tan
[1] Xu Tan,et al. FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.
[2] Zhizheng Wu,et al. Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.
[3] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .
[4] Heiga Zen,et al. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.
[5] Erich Elsen,et al. Efficient Neural Audio Synthesis , 2018, ICML.
[6] Li-Rong Dai,et al. Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[7] Ming Zhou,et al. Close to Human Quality TTS with Transformer , 2018, ArXiv.
[8] Tomoki Toda,et al. An Investigation of Noise Shaping with Perceptual Weighting for Wavenet-Based Speech Generation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[9] Hideyuki Tachibana,et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[10] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.
[11] Shujie Liu,et al. Neural Speech Synthesis with Transformer Network , 2018, AAAI.
[12] Xiaofei Wang,et al. A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
[13] Xu Tan,et al. Almost Unsupervised Text to Speech and Automatic Speech Recognition , 2019, ICML.
[14] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.
[15] Tomoki Toda,et al. Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.
[16] Boris Ginsburg,et al. Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq , 2018, 1805.10387.
[17] Masanori Morise,et al. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..
[18] Srikanth Ronanki,et al. The Blizzard Challenge 2017 , 2017 .
[19] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.
[20] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[21] Sercan Ömer Arik,et al. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.
[22] Tomoharu Iwata,et al. Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[23] Heiga Zen,et al. Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.
[24] Boris Ginsburg,et al. OpenSeq2Seq: Extensible Toolkit for Distributed and Mixed Precision Training of Sequence-to-Sequence Models , 2018, ArXiv.
[25] Shuichi Itahashi,et al. JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .
[26] Heiga Zen,et al. Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[27] Ramón Fernández Astudillo,et al. Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text , 2019, INTERSPEECH.
[28] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.
[29] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.
[30] Tomoki Toda,et al. Back-Translation-Style Data Augmentation for end-to-end ASR , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).
[31] Ryan Prenger,et al. Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[32] Tomoki Koriyama,et al. JVS corpus: free Japanese multi-speaker voice corpus , 2019, ArXiv.
[33] Yuxuan Wang,et al. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.
[34] Heiga Zen,et al. Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.
[35] Alan W. Black,et al. The CMU Arctic speech databases , 2004, SSW.
[36] Ryuichi Yamamoto,et al. Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[37] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[38] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[39] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.
[40] Shinnosuke Takamichi,et al. JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis , 2017, ArXiv.
[41] Heiga Zen,et al. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[42] Zhizheng Wu,et al. Investigating gated recurrent neural networks for speech synthesis , 2016, ArXiv.
[43] Peter L. Søndergaard,et al. A fast Griffin-Lim algorithm , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.
[44] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .