论文信息 - A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis

A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis

In the state-of-the-art statistical parametric speech synthesis system, a speech analysis module, e.g. STRAIGHT spectral analysis, is generally used for obtaining accurate and stable spectral envelopes, and then low-dimensional acoustic features extracted from obtained spectral envelopes are used for training acoustic models. However, a spectral envelope estimation algorithm used in such a speech analysis module includes various processing derived from human knowledge. In this paper, we present our investigation of deep autoencoder based, non-linear, data-driven and unsupervised low-dimensional feature extraction using FFT spectral envelopes for statistical parametric speech synthesis. Experimental results showed that a text-to-speech synthesis system using deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes is indeed a promising approach.

Junichi Yamagishi | Shinji Takaki | J. Yamagishi | Shinji Takaki

[1] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[2] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[3] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4] Keiichi Tokuda,et al. A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[5] Susan Fitt,et al. On generating combilex pronunciations via morphological analysis , 2010, INTERSPEECH.

[6] Bhuvana Ramabhadran,et al. An autoencoder neural-network based low-dimensionality approach to excitation modeling for HMM-based text-to-speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Geoffrey E. Hinton,et al. Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[8] S. King,et al. The Blizzard Challenge 2011 , 2011 .

[9] Tara N. Sainath,et al. Auto-encoder bottleneck features using deep belief networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Quoc V. Le,et al. Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[11] Yasuo Horiuchi,et al. Reverberant speech recognition based on denoising autoencoder , 2013, INTERSPEECH.

[12] Heiga Zen,et al. Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Dong Yu,et al. Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Florian Metze,et al. Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] Yu Tsao,et al. Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[16] Tuomo Raitio,et al. DNN-based stochastic postfilter for HMM-based speech synthesis , 2014, INTERSPEECH.

[17] Alan W. Black,et al. A Deep Learning Approach to Data-driven Parameterizations for Statistical Parametric Speech Synthesis , 2014, ArXiv.

[18] Bhuvana Ramabhadran,et al. Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks , 2014, INTERSPEECH.

[19] James R. Glass,et al. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Lauri Juvela,et al. Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort , 2014, INTERSPEECH.

[21] Frank K. Soong,et al. TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[22] Hermann Ney,et al. Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[23] Dimitri Palaz,et al. Raw Speech Signal-based Continuous Speech Recognition using Convolutional Neural Networks , 2014 .

[24] Masanori Morise,et al. CheapTrick, a spectral envelope estimator for high-quality speech synthesis , 2015, Speech Commun..

[25] Junichi Yamagishi,et al. Multiple feed-forward deep neural networks for statistical parametric speech synthesis , 2015, INTERSPEECH.

[26] Zhizheng Wu,et al. A Function-wise Pre-training Technique for Constructing a Deep Neural Network based Spectral Model in Statistical Parametric Speech Synthesis , 2015 .

[27] Masanori Morise,et al. Error Evaluation of an F0-Adaptive Spectral Envelope Estimator in Robustness against the Additive Noise and F0 Error , 2015, IEICE Trans. Inf. Syst..

[28] Tara N. Sainath,et al. Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.