Fftnet: A Real-Time Speaker-Dependent Neural Vocoder

We introduce FFTNet, a deep learning approach synthesizing audio waveforms. Our approach builds on the recent WaveNet project, which showed that it was possible to synthesize a natural sounding audio waveform directly from a deep convolutional neural network. FFTNet offers two improvements over WaveNet. First it is substantially faster, allowing for real-time synthesis of audio waveforms. Second, when used as a vocoder, the resulting speech sounds more natural, as measured via a “mean opinion score” test.

[1]  Homer Dudley,et al.  The Vocoder—Electrical Re-Creation of Speech * --> , 1940 .

[2]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[3]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[4]  T. Dutoit An introduction to text-to-speech synthesis , 1997 .

[5]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[6]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[8]  A. F. Machado,et al.  VOICE CONVERSION: A CRITICAL SURVEY , 2010 .

[9]  Michael D. Buhrmester,et al.  Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[10]  Daniela Braga,et al.  Evaluating Voice Quality and Speech Synthesis Using Crowdsourcing , 2013, TSD.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[13]  M. Ramos Voice Conversion with Deep Learning , 2016 .

[14]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[15]  Thomas S. Huang,et al.  Fast Wavenet Generation Algorithm , 2016, ArXiv.

[16]  Stephen DiVerdi,et al.  Cute: A concatenative method for voice conversion using exemplar-based unit selection , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Gautham J. Mysore,et al.  Fast and easy crowdsourced perceptual audio evaluation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Stephen DiVerdi,et al.  VoCo , 2017, ACM Trans. Graph..

[19]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[20]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[21]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[22]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[23]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[24]  Tomoki Toda,et al.  Statistical Voice Conversion with WaveNet-Based Waveform Generation , 2017, INTERSPEECH.

[25]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.