Hierarchical Generative Modeling for Controllable Speech Synthesis
暂无分享,去创建一个
Ron J. Weiss | P. Nguyen | Z. Chen | Yonghui Wu | Ron Weiss | H. Zen | Yuan Cao | Yuxuan Wang | Yuanbin Cao | Yu Zhang | Ruoming Pang | Jonathan Shen | Wei-Ning Hsu | Ye Jia | Patrick Nguyen
[1] S. Boll,et al. Suppression of acoustic noise in speech using spectral subtraction , 1979 .
[2] Murray Shanahan,et al. Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders , 2016, ArXiv.
[3] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[4] Satoshi Nakamura,et al. Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
[5] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.
[6] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[7] Hao Tang,et al. Unsupervised Adaptation with Interpretable Disentangled Representations for Distant Conversational Speech Recognition , 2018, INTERSPEECH.
[8] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[9] Hideki Kawahara,et al. YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.
[10] Yuxuan Wang,et al. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.
[11] Tomoki Toda,et al. Back-Translation-Style Data Augmentation for end-to-end ASR , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).
[12] Tara N. Sainath,et al. Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.
[13] Richard M. Stern,et al. Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis , 2008, INTERSPEECH.
[14] Sercan Ömer Arik,et al. Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.
[15] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.
[16] Adam Coates,et al. Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.
[17] Xin Wang,et al. Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis , 2018, ArXiv.
[18] Lior Wolf,et al. Fitting New Speakers Based on a Short Untranscribed Sample , 2018, ICML.
[19] Yutaka Matsuo,et al. Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder , 2018, INTERSPEECH.
[20] Zhiting Hu,et al. Improved Variational Autoencoders for Text Modeling using Dilated Convolutions , 2017, ICML.
[21] Lior Wolf,et al. VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.
[22] Sercan Ömer Arik,et al. Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.
[23] Satoshi Nakamura,et al. Machine Speech Chain with One-shot Speaker Adaptation , 2018, INTERSPEECH.
[24] Patrick Nguyen,et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.
[25] Yoshua Bengio,et al. Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.
[26] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.
[27] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.
[28] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[29] Christopher Burgess,et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.
[30] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.
[31] S. King,et al. The Blizzard Challenge 2013 , 2013, The Blizzard Challenge 2013.
[32] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[33] Yu Zhang,et al. Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
[34] Yu Zhang,et al. Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.
[35] Sercan Ömer Arik,et al. Neural Voice Cloning with a Few Samples , 2018, NeurIPS.
[36] Alexander M. Rush,et al. Semi-Amortized Variational Autoencoders , 2018, ICML.
[37] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[38] John W. Black,et al. Relationships among Fundamental Frequency, Vocal Sound Pressure, and Rate of Speaking , 1961 .
[39] Samy Bengio,et al. Generating Sentences from a Continuous Space , 2015, CoNLL.
[40] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.
[41] L. Streeter,et al. Effects of Pitch and Speech Rate on Personal Attributions , 1979 .
[42] Erich Elsen,et al. Efficient Neural Audio Synthesis , 2018, ICML.
[43] Lars Hertel,et al. Approximate Inference for Deep Latent Gaussian Mixtures , 2016 .
[44] Huachun Tan,et al. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering , 2016, IJCAI.
[45] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.