论文信息 - Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis

Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis

This paper proposes a novel approach for directly-modeling speech at the waveform level using a neural network. This approach uses the neural network-based statistical parametric speech synthesis framework with a specially designed output layer. As acoustic feature extraction is integrated to acoustic model training, it can overcome the limitations of conventional approaches, such as two-step (feature extraction and acoustic modeling) optimization, use of spectra rather than waveforms as targets, use of overlapping and shifting frames as unit, and fixed decision tree structure. Experimental results show that the proposed approach can directly maximize the likelihood defined at the waveform domain.

Heiga Zen | Keiichi Tokuda | H. Zen | K. Tokuda

[1] F. Itakura,et al. A statistical method for estimation of speech spectral density and formant frequencies , 1970 .

[2] F. Itakura. Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[3] K. Dzhaparidze. Parameter estimation and hypothesis testing in spectral analysis of stationary time series , 1986 .

[4] Alan V. Oppenheim,et al. Discrete-Time Signal Pro-cessing , 1989 .

[5] Jing Peng,et al. An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[6] Jj Odell,et al. The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[7] Keiichi Tokuda,et al. Adaptive cepstral analysis of speech , 1995, IEEE Trans. Speech Audio Process..

[8] Keiichi Tokuda,et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[9] Hideki Kawahara,et al. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[10] Keiichi Tokuda,et al. Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005 .

[11] Ren-Hua Wang,et al. Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12] 全炳河,et al. Reformulating HMM as a trajectory model by imposing explicit relationships between static and dynamic features , 2006 .

[13] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14] Heiga Zen,et al. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[15] Keiichi Tokuda,et al. Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis , 2008, INTERSPEECH.

[16] Keiichi Tokuda,et al. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Heiga Zen,et al. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters , 2010, SSW.

[18] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[19] Heiga Zen,et al. Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20] Heiga Zen,et al. Deep learning in speech synthesis , 2013, SSW.

[21] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[22] Yoshihiko Nankaku,et al. Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis , 2014, IEICE Trans. Inf. Syst..

[23] Heiga Zen,et al. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).