论文信息 - Directly modeling voiced and unvoiced components in speech waveforms by neural networks

Directly modeling voiced and unvoiced components in speech waveforms by neural networks

This paper proposes a novel acoustic model based on neural networks for statistical parametric speech synthesis. The neural network outputs parameters of a non-zero mean Gaussian process, which defines a probability density function of a speech waveform given linguistic features. The mean and covariance functions of the Gaussian process represent deterministic (voiced) and stochastic (unvoiced) components of a speech waveform, whereas the previous approach considered the unvoiced component only. Experimental results show that the proposed approach can generate speech waveforms approximating natural speech waveforms.

Heiga Zen | Keiichi Tokuda | H. Zen | K. Tokuda

[1] Yoshihiko Nankaku,et al. Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis , 2014, IEICE Trans. Inf. Syst..

[2] Keiichi Tokuda,et al. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Heiga Zen,et al. Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[5] Jing Peng,et al. An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[6] Heiga Zen,et al. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters , 2010, SSW.

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Hideki Kawahara,et al. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[9] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[10] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11] F. Itakura. Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[12] Keiichi Tokuda,et al. Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis , 2008, INTERSPEECH.

[13] Keiichi Tokuda,et al. Adaptive cepstral analysis of speech , 1995, IEEE Trans. Speech Audio Process..

[14] Keiichi Tokuda,et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[15] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[16] Heiga Zen,et al. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Geoffrey E. Hinton,et al. On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] Heiga Zen,et al. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).