论文信息 - Fftnet: A Real-Time Speaker-Dependent Neural Vocoder

Fftnet: A Real-Time Speaker-Dependent Neural Vocoder

We introduce FFTNet, a deep learning approach synthesizing audio waveforms. Our approach builds on the recent WaveNet project, which showed that it was possible to synthesize a natural sounding audio waveform directly from a deep convolutional neural network. FFTNet offers two improvements over WaveNet. First it is substantially faster, allowing for real-time synthesis of audio waveforms. Second, when used as a vocoder, the resulting speech sounds more natural, as measured via a “mean opinion score” test.

[1] Homer Dudley,et al. The Vocoder—Electrical Re-Creation of Speech * --> , 1940 .

[2] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .

[3] S. Imai,et al. Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[4] T. Dutoit. An introduction to text-to-speech synthesis , 1997 .

[5] Alan W. Black,et al. The CMU Arctic speech databases , 2004, SSW.

[6] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7] Philipos C. Loizou,et al. Speech Enhancement: Theory and Practice , 2007 .

[8] A. F. Machado,et al. VOICE CONVERSION: A CRITICAL SURVEY , 2010 .

[9] Michael D. Buhrmester,et al. Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[10] Daniela Braga,et al. Evaluating Voice Quality and Speech Synthesis Using Crowdsourcing , 2013, TSD.

[11] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12] Alex Graves,et al. Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[13] M. Ramos. Voice Conversion with Deep Learning , 2016 .

[14] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[15] Thomas S. Huang,et al. Fast Wavenet Generation Algorithm , 2016, ArXiv.

[16] Stephen DiVerdi,et al. Cute: A concatenative method for voice conversion using exemplar-based unit selection , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Gautham J. Mysore,et al. Fast and easy crowdsourced perceptual audio evaluation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Stephen DiVerdi,et al. VoCo , 2017, ACM Trans. Graph..

[19] Samy Bengio,et al. Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[20] Tomoki Toda,et al. Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[21] Adam Coates,et al. Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[22] Yoshua Bengio,et al. Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[23] Karen Simonyan,et al. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[24] Tomoki Toda,et al. Statistical Voice Conversion with WaveNet-Based Waveform Generation , 2017, INTERSPEECH.

[25] Xavier Serra,et al. A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.