论文信息 - Statistical parametric speech synthesis: from HMM to LSTM-RNN

Statistical parametric speech synthesis: from HMM to LSTM-RNN

Statistical parametric speech synthesis (SPSS) combines an acoustic model and a vocoder to render speech given a text. Typically decision tree-clustered context-dependent hidden Markov models (HMMs) are employed as the acoustic model, which represent a relationship between linguistic and acoustic features. Recently, artificial neural network-based acoustic models, such as deep neural networks, mixture density networks, and long short-term memory recurrent neural networks (LSTM-RNNs), showed significant improvements over the HMM-based approach. This paper reviews the progress of acoustic modeling in SPSS from the HMM to the LSTM-RNN.

Heiga Zen | H. Zen

[1] S. King,et al. Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis , 2013, SSW.

[2] Zhizheng Wu,et al. Sentence-level control vectors for deep neural network speech synthesis , 2015, INTERSPEECH.

[3] Heiga Zen,et al. Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] Heiga Zen,et al. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[5] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[6] Heiga Zen,et al. A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7] Heiga Zen,et al. Decision tree-based context clustering based on cross validation and hierarchical priors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] H. Zen,et al. An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[9] Bhuvana Ramabhadran,et al. F0 contour prediction with a deep belief network-Gaussian process hybrid model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Orhan Karaali,et al. Speech Synthesis with Neural Networks , 1998, ArXiv.

[11] Heiga Zen,et al. Deep learning in speech synthesis , 2013, SSW.

[12] Noel Massey,et al. Text-to-speech conversion with neural networks: a recurrent TDNN approach , 1998, EUROSPEECH.

[13] Anthony J. Robinson,et al. Static and Dynamic Error Propagation Networks with Application to Speech Coding , 1987, NIPS.

[14] Zhizheng Wu,et al. Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features , 2015, INTERSPEECH.

[15] Mike Schuster,et al. On supervised learning from sequential data with applications for speech regognition , 1999 .

[16] Heiga Zen,et al. Product of Experts for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17] Mark J. F. Gales. Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[18] Ranniery Maia,et al. Towards a linear dynamical model based speech synthesizer , 2015, INTERSPEECH.

[19] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[20] P J Webros. BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[21] Frank K. Soong,et al. On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Heiga Zen,et al. Estimating Trajectory Hmm Parameters Using Monte Carlo Em With Gibbs Sampler , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23] Junichi Yamagishi,et al. Average-Voice-Based Speech Synthesis , 2006 .

[24] Heiga Zen,et al. A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[25] Marcus Liwicki,et al. A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks , 2007 .

[26] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[27] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[28] S. Srihari. Mixture Density Networks , 1994 .

[29] Keiichi Tokuda,et al. An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[30] Takashi Nose,et al. Statistical Parametric Speech Synthesis Based on Gaussian Process Regression , 2014, IEEE Journal of Selected Topics in Signal Processing.

[31] K. Koishida,et al. Vector quantization of speech spectral parameters using statistics of dynamic features , 1997 .

[32] Bhuvana Ramabhadran,et al. Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks , 2014, INTERSPEECH.

[33] Keiichi Tokuda,et al. Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[34] Keiichi Tokuda,et al. A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[35] Keiichi Tokuda,et al. Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[36] Keiichi Tokuda,et al. Multi-Space Probability Distribution HMM , 2002 .

[37] Ranniery Maia,et al. Linear dynamical models in speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Tomoki Toda,et al. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[39] Heiga Zen,et al. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] Yannis Agiomyrgiannakis,et al. Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Heiga Zen,et al. The Effect of Using Normalized Models in Statistical Speech Synthesis , 2011, INTERSPEECH.

[42] 全炳河,et al. Reformulating HMM as a trajectory model by imposing explicit relationships between static and dynamic features , 2006 .

[43] Simon King. A reading list of recent advances in speech synthesis , 2015 .

[44] Cassia Valentini-Botinhao,et al. Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45] F. Itakura,et al. A statistical method for estimation of speech spectral density and formant frequencies , 1970 .

[46] Keiichi Tokuda,et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[47] Nam Soo Kim,et al. Decision Tree-Based Clustering with Outlier Detection for HMM-Based Speech Synthesis , 2011, INTERSPEECH.

[48] Yoshihiko Nankaku,et al. The effect of neural networks in statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49] Li-Rong Dai,et al. Statistical parametric speech synthesis using a hidden trajectory model , 2015, Speech Commun..

[50] Paavo Alku,et al. Voice source modelling using deep neural networks for statistical parametric speech synthesis , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[51] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[52] Heiga Zen,et al. Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis , 2011, Speech Commun..

[53] Zhi-Jie Yan,et al. Cross-validation based decision tree clustering for HMM-based TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[54] Jeff A. Bilmes,et al. Robust splicing costs and efficient search with BMM Models for concatenative speech synthesis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[55] Sridha Sridharan,et al. Trainable speech synthesis with trended hidden Markov models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[56] Dong Yu,et al. Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[57] Takashi Nose,et al. A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[58] Alan W. Black,et al. Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[59] Frank K. Soong,et al. Generating natural F0 trajectory with additive trees , 2008, INTERSPEECH.

[60] Simon King,et al. Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61] Helen M. Meng,et al. Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[62] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[63] Ren-Hua Wang,et al. Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[64] Jong-Jin Kim,et al. HMM-based Korean speech synthesis system for hand-held devices , 2006, IEEE Transactions on Consumer Electronics.

[65] Keiichi Tokuda,et al. Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[66] Heiga Zen,et al. An excitation model for HMM-based speech synthesis based on residual modeling , 2007, SSW.

[67] Geoffrey E. Hinton,et al. Distributed Representations , 1986, The Philosophy of Artificial Intelligence.

[68] Heiga Zen,et al. Autoregressive Models for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[69] Harri Valpola,et al. Bayesian Ensemble Learning for Nonlinear Factor Analysis , 2000 .

[70] Zhizheng Wu,et al. Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning , 2015, INTERSPEECH.

[71] Geoffrey E. Hinton,et al. On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[72] Nir Friedman,et al. Probabilistic Graphical Models , 2009, Data-Driven Computational Neuroscience.

[73] Yoshihiko Nankaku,et al. Contextual Additive Structure for HMM-Based Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[74] Frank K. Soong,et al. Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[75] Heiga Zen,et al. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[76] Heiga Zen,et al. Context-dependent additive log f_0 model for HMM-based speech synthesis , 2009, INTERSPEECH.

[77] Dong Yu,et al. Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[78] Frank K. Soong,et al. TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[79] Sadaoki Furui,et al. Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[80] Georg Heigold,et al. An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[81] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[82] Haifeng Li,et al. Sequence error (SE) minimization training of neural network for voice conversion , 2014, INTERSPEECH.

[83] Jj Odell,et al. The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[84] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[85] S. Roweis,et al. Learning Nonlinear Dynamical Systems Using the Expectation–Maximization Algorithm , 2001 .

[86] William J. Byrne,et al. Autoregressive clustering for HMM speech synthesis , 2010, INTERSPEECH.

[87] Satoshi Imai,et al. Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[88] Eric Moulines,et al. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[89] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[90] Keiichi Tokuda,et al. Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[91] Feng Ding,et al. A polynomial segment model based statistical parametric speech synthesis sytem , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[92] Mark J. F. Gales,et al. Switching linear dynamical systems for speech recognition , 2003 .

[93] Kai Yu,et al. An investigation of implementation and performance analysis of DNN based speech synthesis system , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[94] Frank K. Soong,et al. Modeling DCT parameterized F0 trajectory at intonation phrase level with DNN or decision tree , 2014, INTERSPEECH.

[95] Heiga Zen,et al. Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[96] Tomoki Toda,et al. A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97] Shaul Markovitch,et al. Anytime Learning of Decision Trees , 2007, J. Mach. Learn. Res..

[98] Tony Robinson,et al. Speech synthesis using artificial neural networks trained on cepstral coefficients , 1993, EUROSPEECH.

[99] Carl Quillen. Kalman filter based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[100] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.