Statistical parametric speech synthesis using deep neural networks

Conventional approaches to statistical parametric speech synthesis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limitations, e.g. decision trees are inefficient to model complex context dependencies. This paper examines an alternative scheme that is based on a deep neural network (DNN). The relationship between input texts and their acoustic realizations is modeled by a DNN. The use of the DNN can address some limitations of the conventional approach. Experimental results show that the DNN-based systems outperformed the HMM-based systems with similar numbers of parameters.

[1]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[2]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[4]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[6]  Orhan Karaali,et al.  Speech Synthesis with Neural Networks , 1998, ArXiv.

[7]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[8]  Wu Chou,et al.  Decision tree state tying based on penalized Bayesian information criterion , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Keiichi Tokuda,et al.  Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[12]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[13]  Keiichi Tokuda,et al.  Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[14]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005 .

[15]  Takahiro Shinozaki Hmm State Clustering Based on Efficient Cross-Validation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Jong-Jin Kim,et al.  HMM-based Korean speech synthesis system for hand-held devices , 2006, IEEE Transactions on Consumer Electronics.

[17]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[18]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[19]  Shaul Markovitch,et al.  Anytime Learning of Decision Trees , 2007, J. Mach. Learn. Res..

[20]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[21]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[23]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[25]  Frank K. Soong,et al.  Generating natural F0 trajectory with additive trees , 2008, INTERSPEECH.

[26]  Heiga Zen,et al.  Acoustic modeling with contextual additive structure for HMM-based speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Heiga Zen,et al.  Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Alexander Gutkin,et al.  Quantized HMMs for low footprint text-to-speech synthesis , 2010, INTERSPEECH.

[29]  Kai Yu,et al.  Word-level emphasis modelling in HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Heiga Zen,et al.  Decision tree-based context clustering based on cross validation and hierarchical priors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Heiga Zen,et al.  Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis , 2011, Speech Commun..

[32]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[33]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[34]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Steve Renals,et al.  Deep Architectures for Articulatory Inversion , 2012, INTERSPEECH.

[36]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[37]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[38]  Heiga Zen,et al.  Product of Experts for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .