论文信息 - Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort

Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort

This paper studies a deep neural network (DNN) based voice source modelling method in the synthesis of speech with varying vocal effort. The new trainable voice source model learns a mapping between the acoustic features and the time-domain pitch-synchronous glottal flow waveform using a DNN. The voice source model is trained with various speech material from breathy, normal, and Lombard speech. In synthesis, a normal voice is first adapted to a desired style, and using the flexible DNN-based voice source model, a style-specific excitation waveform is automatically generated based on the adapted acoustic features. The proposed voice source model is compared to a robust and high-quality excitation modelling method based on manually selected mean glottal flow pulses for each vocal effort level and using a spectral matching filter to correctly match the voice source spectrum to a desired style. Subjective evaluations show that the proposed DNN-based method is rated comparable to the baseline method, but avoids the manual selection of the pulses and is computationally faster than a system using a spectral matching filter.

[1] Alexander Gutkin,et al. Quantized HMMs for low footprint text-to-speech synthesis , 2010, INTERSPEECH.

[2] Thierry Dutoit,et al. A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis , 2019, INTERSPEECH.

[3] Thierry Dutoit,et al. The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Paavo Alku,et al. Voice source modelling using deep neural networks for statistical parametric speech synthesis , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[5] Keiichi Tokuda,et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[6] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[7] Paavo Alku,et al. HMM-based Finnish text-to-speech system utilizing glottal inverse filtering , 2008, INTERSPEECH.

[8] Junichi Yamagishi,et al. Towards an improved modeling of the glottal source in statistical parametric speech synthesis , 2007, SSW.

[9] Hideki Kawahara,et al. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[10] Takashi Nose,et al. A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[11] Alan W. Black,et al. Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12] Minsoo Hahn,et al. Two-Band Excitation for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[13] Takao Kobayashi,et al. Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Jong-Jin Kim,et al. HMM-based Korean speech synthesis system for hand-held devices , 2006, IEEE Transactions on Consumer Electronics.

[15] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[16] Keiichi Tokuda,et al. Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[17] Keiichi Tokuda,et al. Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[18] Paavo Alku,et al. Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise , 2014, Comput. Speech Lang..

[19] Kenji Matsui,et al. Improving naturalness in text-to-speech synthesis using natural glottal source , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[20] J. Makhoul,et al. Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[21] Keiichi Tokuda,et al. Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[22] Paavo Alku,et al. Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] G. Fries. Hybrid time- and frequency-domain speech synthesis with extended glottal source generation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[24] Nam Soo Kim,et al. Excitation modeling based on waveform interpolation for HMM-based speech synthesis , 2010, INTERSPEECH.

[25] John Kane,et al. HMM-based synthesis of creaky voice , 2013, INTERSPEECH.

[26] J. Holmes,et al. The influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer , 1973 .

[27] Paavo Alku,et al. Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[28] Keiichi Tokuda,et al. Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[29] Paavo Alku,et al. Comparing glottal-flow-excited statistical parametric speech synthesis methods , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30] Heiga Zen,et al. Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31] Junichi Yamagishi,et al. Glottal spectral separation for parametric speech synthesis , 2008, INTERSPEECH.

[32] Heiga Zen,et al. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters , 2010, SSW.

[33] I. Titze. Nonlinear source-filter coupling in phonation: theory. , 2008, The Journal of the Acoustical Society of America.

[34] Heiga Zen,et al. An excitation model for HMM-based speech synthesis based on residual modeling , 2007, SSW.

[35] Paavo Alku,et al. HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[36] Thierry Dutoit,et al. Analysis and HMM-based synthesis of hypo and hyperarticulated speech , 2014, Comput. Speech Lang..

[37] Thierry Dutoit,et al. Using a pitch-synchronous residual codebook for hybrid HMM/frame selection speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39] Heiga Zen,et al. Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[40] Heiga Zen,et al. The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.