NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
暂无分享,去创建一个
Kai Shen | Tao Qin | Xu Tan | Zeqian Ju | Lei He | Sheng Zhao | Yanqing Liu | Yichong Leng | Jiang Bian
[1] Linquan Liu,et al. FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model , 2023, ArXiv.
[2] Damien Vincent,et al. Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision , 2023, Transactions of the Association for Computational Linguistics.
[3] Y. Bengio,et al. Regeneration Learning: A Learning Paradigm for Data Generation , 2023, AAAI.
[4] Jinyu Li,et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , 2023, ArXiv.
[5] Gabriel Synnaeve,et al. High Fidelity Neural Audio Compression , 2022, ArXiv.
[6] Yaniv Taigman,et al. AudioGen: Textually Guided Audio Generation , 2022, ICLR.
[7] David Grangier,et al. AudioLM: A Language Modeling Approach to Audio Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[8] Xu Tan,et al. DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders , 2022, INTERSPEECH.
[9] Zhiwei Xiong,et al. RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion , 2022, INTERSPEECH.
[10] Tie-Yan Liu,et al. NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[11] Arnaldo Cândido Júnior,et al. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone , 2021, ICML.
[12] Jinyu Li,et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.
[13] Zhou Zhao,et al. DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism , 2021, AAAI.
[14] João Paulo Teixeira,et al. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese , 2020, Lang. Resour. Evaluation.
[15] Marco Tagliasacchi,et al. SoundStream: An End-to-End Neural Audio Codec , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[16] Lei He,et al. DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021 , 2021, ArXiv.
[17] Tao Qin,et al. A Survey on Neural Speech Synthesis , 2021, ArXiv.
[18] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[19] Jungil Kong,et al. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech , 2021, ICML.
[20] Tasnima Sadekova,et al. Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech , 2021, ICML.
[21] Nam Soo Kim,et al. Diff-TTS: A Denoising Diffusion Model for Text-to-Speech , 2021, Interspeech.
[22] B. Ommer,et al. Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Abhishek Kumar,et al. Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.
[24] Bryan Catanzaro,et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.
[25] Heiga Zen,et al. WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.
[26] Tie-Yan Liu,et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2020, ICLR.
[27] Gabriel Synnaeve,et al. MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.
[28] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[29] Abdel-rahman Mohamed,et al. Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[30] Ali Razavi,et al. Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.
[31] Xu Tan,et al. FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.
[32] Xu Tan,et al. Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion , 2019, INTERSPEECH.
[33] Heiga Zen,et al. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.
[34] Jan Skoglund,et al. LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[35] Shujie Liu,et al. Neural Speech Synthesis with Transformer Network , 2018, AAAI.
[36] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[37] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.
[38] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[39] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.
[40] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[41] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.
[42] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[43] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.
[44] Li Zhao,et al. Attention-based LSTM for Aspect-level Sentiment Classification , 2016, EMNLP.
[45] Junichi Yamagishi,et al. SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .
[46] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.
[47] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[48] Joseph P. Olive,et al. Text-to-speech synthesis , 1995, AT&T Technical Journal.