Generative approach using the noise generation models for DNN-based speech synthesis trained from noisy speech

This paper proposes a generative approach to construct high-quality speech synthesis from noisy speech. Studio-quality recorded speech is required to construct high-quality speech synthesis, but most of existing speech has been recorded in a noisy environment. A common method to use noisy speech for training speech synthesis models is reducing the noise before the vocoder-based parameterization. However, such multi-step processes cause an accumulation of spectral distortion. Meanwhile, statistical parametric speech synthesis (SPSS) without vocoders, which directly generates spectral parameters or waveforms, has been proposed recently. The vocoder-free SPSS will enable us to train speech synthesis models considering the noise addition process generally used in signal processing research. In the proposed approach, newly introduced noise generation models trained by a generative adversarial training algorithm randomly generates spectra of the noise. The speech synthesis models are trained to make the sum of their output and the randomly generated noise close to spectra of noisy speech. Because the noise generation model parameters fit the spectrum of the observed noise, the proposed method can alleviate the spectral distortion found in the conventional method. Experimental results demonstrate that the proposed method outperforms the conventional method in terms of synthetic speech quality.

[1]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[2]  Shinnosuke Takamichi,et al.  CPJD Corpus: Crowdsourced Parallel Speech Corpus of Japanese Dialects , 2018, LREC.

[3]  Thomas Niesler,et al.  Very Low Resource Radio Browsing for Agile Developmental and Humanitarian Monitoring , 2017, INTERSPEECH.

[4]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[6]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7]  Shinnosuke Takamichi,et al.  JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis , 2017, ArXiv.

[8]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[9]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[11]  Alan W. Black,et al.  Utterance Selection Techniques for TTS Systems Using Found Speech , 2016, SSW.

[12]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[13]  Yuji Matsumoto,et al.  Adversarial Training for Cross-Domain Universal Dependency Parsing , 2017, CoNLL Shared Task.

[14]  Kiyohiro Shikano,et al.  Musical-Noise-Free Speech Enhancement Based on Optimized Iterative Spectral Subtraction , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Tomoki Toda,et al.  Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Shinnosuke Takamichi,et al.  Training algorithm to deceive Anti-Spoofing Verification for DNN-based speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Shigeru Katagiri,et al.  A large-scale Japanese speech database , 1990, ICSLP.

[18]  Junichi Yamagishi,et al.  Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks , 2016, INTERSPEECH.

[19]  Hirokazu Kameoka,et al.  Direct Modeling of Frequency Spectra and Waveform Generation Based on Phase Recovery for DNN-Based Speech Synthesis , 2017, INTERSPEECH.

[20]  Jérôme Idier,et al.  Algorithms for Nonnegative Matrix Factorization with the β-Divergence , 2010, Neural Computation.

[21]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[22]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[23]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[24]  Yamato Ohtani,et al.  Statistical Bandwidth Extension for Speech Synthesis Based on Gaussian Mixture Model with Sub-Band Basis Spectrum Model , 2016, IEICE Trans. Inf. Syst..

[25]  Shinnosuke Takamichi,et al.  Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Tomoki Koriyama,et al.  Sampling-Based Speech Parameter Generation Using Moment-Matching Networks , 2017, INTERSPEECH.

[27]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[28]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[29]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[30]  Mark Hasegawa-Johnson,et al.  Speech Enhancement Using Bayesian Wavenet , 2017, INTERSPEECH.