On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis

Deep Neural Network (DNN), which can model a long-span, intricate transform compactly with a deep-layered structure, has recently been investigated for parametric TTS synthesis with a fairly large corpus (33,000 utterances) [6]. In this paper, we examine DNN TTS synthesis with a moderate size corpus of 5 hours, which is more commonly used for parametric TTS training. DNN is used to map input text features into output acoustic features (LSP, F0 and V/U). Experimental results show that DNN can outperform the conventional HMM, which is trained in ML first and then refined by MGE. Both objective and subjective measures indicate that DNN can synthesize speech better than HMM-based baseline. The improvement is mainly on the prosody, i.e., the RMSE of natural and generated F0 trajectories by DNN is improved by 2 Hz. This benefit is likely from the key characteristics of DNN, which can exploit feature correlations, e.g., between F0 and spectrum, without using a more restricted, e.g. diagonal Gaussian probability family. Our experimental results also show: the layer-wise BP pre-training can drive weights to a better starting point than random initialization and result in a more effective DNN; state boundary info is important for training DNN to yield better synthesized speech; and a hyperbolic tangent activation function in DNN hidden layers yields faster convergence than a sigmoidal one.

[1]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[2]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[3]  Michael Picheny,et al.  New methods in continuous Mandarin speech recognition , 1997, EUROSPEECH.

[4]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[6]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[8]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006 .

[9]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[11]  Dong Yu,et al.  Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition , 2010 .

[12]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[13]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[14]  Tara N. Sainath,et al.  Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[16]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  S. King,et al.  Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis , 2013, SSW.

[18]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Helen M. Meng,et al.  Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Dong Yu,et al.  Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.