论文信息 - Model-Based Parametric Prosody Synthesis with Deep Neural Network

Model-Based Parametric Prosody Synthesis with Deep Neural Network

Conventional statistical parametric speech synthesis (SPSS) captures only frame-wise acoustic observations and computes probability densities at HMM state level to obtain statistical acoustic models combined with decision trees, which is therefore a purely statistical data-driven approach without explicit integration of any articulatory mechanisms found in speech production research. The present study explores an alternative paradigm, namely, model-based parametric prosody synthesis (MPPS), which integrates dynamic mechanisms of human speech production as a core component of F0 generation. In this paradigm, contextual variations in prosody are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. Here the motor model is target approximation (TA), which generates syllable-sized F0 contours with only three motor parameters that are associated to linguistic functions. In this study, we simulate this two-stage process by linking the TA model to a deep neural network (DNN), which learns the “linguistic-motor” mapping given the “motor-acoustic” mapping provided by TA-based syllable-wise F0 production. The proposed prosody modeling system outperforms the HMM-based baseline system in both objective and subjective evaluations.

[1] Heiga Zen,et al. Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[2] Frank K. Soong,et al. Modeling pitch trajectory by hierarchical HMM with minimum generation error training , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Santitham Prom-on,et al. Toward invariant functional representations of variable surface fundamental frequency contours: Synthesizing speech melody via model-based stochastic learning , 2014, Speech Commun..

[4] Frank K. Soong,et al. A hierarchical F0 modeling method for HMM-based speech synthesis , 2010, INTERSPEECH.

[5] Keiichi Tokuda,et al. Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[6] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7] Heiga Zen,et al. Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[8] Philip N. Garner,et al. Convolutional Pitch Target Approximation Model for Speech Synthesis , 2013 .

[9] Emily Q. Wang,et al. Pitch targets and their realization: Evidence from Mandarin Chinese , 2001, Speech Commun..

[10] Lianhong Cai,et al. Modeling pitch contour of Chinese Mandarin sentences with the PENTA model , 2012 .

[11] Jorge J. Moré,et al. The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[12] Frank K. Soong,et al. TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[13] Heiga Zen,et al. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Heiga Zen,et al. Context-dependent additive log f_0 model for HMM-based speech synthesis , 2009, INTERSPEECH.

[15] Santitham Prom-on,et al. Modeling tone and intonation in Mandarin and English as a process of target approximation. , 2009, The Journal of the Acoustical Society of America.

[16] Frank K. Soong,et al. On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Zhen-Hua Ling,et al. Vowel Creation by Articulatory Control in HMM-based Parametric Speech Synthesis , 2012, INTERSPEECH.

[18] Frank K. Soong,et al. Modeling DCT parameterized F0 trajectory at intonation phrase level with DNN or decision tree , 2014, INTERSPEECH.

[19] Ren-Hua Wang,et al. Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge , 2008, INTERSPEECH.

[20] Patricia Riddle,et al. Modelling and synthesising F0 contours with the discrete cosine transform , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22] Heiga Zen,et al. Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23] Heiga Zen,et al. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[24] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[25] Xihong Wu,et al. Hierarchical pitch target model for Mandarin speech , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[26] Heiga Zen,et al. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Yi Xu,et al. Speech melody as articulatorily implemented communicative functions , 2005, Speech Commun..

[28] Helen M. Meng,et al. Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29] Ren-Hua Wang,et al. Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[30] Keiichi Tokuda,et al. Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis , 2004, SSW.

[31] M. Newville,et al. Lmfit: Non-Linear Least-Square Minimization and Curve-Fitting for Python , 2014 .

[32] Kai Yu,et al. Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[33] David Talkin,et al. A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[34] Zhizheng Wu,et al. Modeling and Generating Tone Contour with Phrase Intonation for Mandarin Chinese Speech , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[35] Dong Yu,et al. Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[36] Li-Rong Dai,et al. Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[37] Zhen-Hua Ling,et al. Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression , 2013, IEEE Transactions on Audio, Speech, and Language Processing.