论文信息 - Statistical parametric speech synthesis using deep neural networks

Statistical parametric speech synthesis using deep neural networks

Conventional approaches to statistical parametric speech synthesis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limitations, e.g. decision trees are inefficient to model complex context dependencies. This paper examines an alternative scheme that is based on a deep neural network (DNN). The relationship between input texts and their acoustic realizations is modeled by a DNN. The use of the DNN can address some limitations of the conventional approach. Experimental results show that the DNN-based systems outperformed the HMM-based systems with similar numbers of parameters.

[1] Sadaoki Furui,et al. Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[2] Keiichi Tokuda,et al. An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3] Jj Odell,et al. The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[4] Alan W. Black,et al. Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5] Keiichi Tokuda,et al. Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[6] Orhan Karaali,et al. Speech Synthesis with Neural Networks , 1998, ArXiv.

[7] Keiichi Tokuda,et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[8] Wu Chou,et al. Decision tree state tying based on penalized Bayesian information criterion , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10] Keiichi Tokuda,et al. Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11] Keiichi Tokuda,et al. Multi-Space Probability Distribution HMM , 2002 .

[12] H. Zen,et al. An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[13] Keiichi Tokuda,et al. Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[14] Keiichi Tokuda,et al. Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005 .

[15] Takahiro Shinozaki. Hmm State Clustering Based on Efficient Cross-Validation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16] Jong-Jin Kim,et al. HMM-based Korean speech synthesis system for hand-held devices , 2006, IEEE Transactions on Consumer Electronics.

[17] Takashi Nose,et al. A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[18] Heiga Zen,et al. Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[19] Shaul Markovitch,et al. Anytime Learning of Decision Trees , 2007, J. Mach. Learn. Res..

[20] Keiichi Tokuda,et al. A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[21] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22] Heiga Zen,et al. Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[23] Tomoki Toda,et al. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[25] Frank K. Soong,et al. Generating natural F0 trajectory with additive trees , 2008, INTERSPEECH.

[26] Heiga Zen,et al. Acoustic modeling with contextual additive structure for HMM-based speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27] Heiga Zen,et al. Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[28] Alexander Gutkin,et al. Quantized HMMs for low footprint text-to-speech synthesis , 2010, INTERSPEECH.

[29] Kai Yu,et al. Word-level emphasis modelling in HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30] Heiga Zen,et al. Decision tree-based context clustering based on cross validation and hierarchical priors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Heiga Zen,et al. Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis , 2011, Speech Commun..

[32] Dong Yu,et al. Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[33] Dong Yu,et al. Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[34] Kai Yu,et al. Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[35] Steve Renals,et al. Deep Architectures for Articulatory Inversion , 2012, INTERSPEECH.

[36] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[37] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[38] Heiga Zen,et al. Product of Experts for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[39] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .