Personalized Spontaneous Speech Synthesis Using a Small-Sized Unsegmented Semispontaneous Speech

A systematic approach is proposed to synthesizing personalized spontaneous speech using a small-sized unsegmented speech corpus of the target speaker. First, an automatic segmentation algorithm is employed to segment and label a collected semispontaneous speech corpus of the target speaker. Then, a pretrained average voice model is adapted to the voice model of the target speaker by using the segmented data. A postfilter based on modulation spectrum is adopted to further improve the speaker similarity of the synthesized speech as well as alleviate the over-smoothing problem of the synthesized speech. For generating spontaneous speech, a smoothing method applied at the prosodic word level is proposed to improve speech fluency. For objective evaluation on spontaneous speech segmentation, the segmentation accuracy of the proposed method is superior to that of Viterbi-based forced alignment. The results of subjective listening test also show that the proposed method can improve the spontaneity and speaker similarity of the synthesized speech compared to the maximum likelihood linear regression based speaker adaptation method.

[1]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[3]  Gayatri M. Bhandari,et al.  Audio Segmentation for Speech Recognition Using Segment Features , 2014 .

[4]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[5]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[6]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[7]  Chin-Hui Lee,et al.  Toward a detector-based universal phone recognizer , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Takashi Nose,et al.  Conversational spontaneous speech synthesis using average voice model , 2010, INTERSPEECH.

[9]  Asaf Rendel,et al.  Towards automatic phonetic segmentation for TTS , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Joan Claudi Socoró,et al.  Voice Quality Modelling for Expressive Speech Synthesis , 2014, TheScientificWorldJournal.

[11]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[12]  Chung-Hsien Wu,et al.  Multiple change-point audio segmentation and classification using an MDL-based Gaussian model , 2006, IEEE Trans. Speech Audio Process..

[13]  Chung-Hsien Wu,et al.  Synthesis of Spontaneous Speech With Syllable Contraction Using State-Based Context-Dependent Voice Transformation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Alan W. Black,et al.  Prediction of pronunciation variations for speech synthesis: a data-driven approach , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[15]  Tanja Schultz,et al.  Multilingual articulatory features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Mari Ostendorf,et al.  Joint prosody prediction and unit selection for concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Mauro Cettolo,et al.  Evaluation of BIC-based algorithms for audio segmentation , 2005, Comput. Speech Lang..

[18]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Takashi Nose,et al.  On the Use of Extended Context for HMM-Based Spontaneous Conversational Speech Synthesis , 2011, INTERSPEECH.

[20]  Tomoki Toda,et al.  Parameter generation algorithm considering Modulation Spectrum for HMM-based speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tuomo Raitio,et al.  A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[23]  Chiu-yu Tseng Speech Rate and Prosody Units: Evidence of Interaction from Mandarin Chinese , 2003 .

[24]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[26]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Richard M. Schwartz,et al.  Practical Implementations of Speaker-Adaptive Training , 1997 .

[28]  Chung-Hsien Wu,et al.  Automatic generation of synthesis units and prosodic information for Chinese concatenative synthesis , 2001, Speech Commun..

[29]  Chung-Hsien Wu,et al.  Pronunciation variation generation for spontaneous speech synthesis using state-based voice transformation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Yu Tsao,et al.  A study on detection based automatic speech recognition , 2006, INTERSPEECH.

[31]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[32]  Tomoki Toda,et al.  Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  K. Tokuda,et al.  A Training Method of Average Voice Model for HMM-Based Speech Synthesis , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[34]  Tomoki Toda,et al.  Post-Filters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016 .

[35]  Cai Rui TH-CoSS,a Mandarin Speech Corpus for TTS , 2007 .

[36]  Chung-Hsien Wu,et al.  Idiolect Extraction and Generation for Personalized Speaking Style Modeling , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Rüdiger Hoffmann,et al.  Toward spontaneous speech Synthesis-utilizing language model information in TTS , 2004, IEEE Transactions on Speech and Audio Processing.

[38]  Ellen Eide Distinctive features for use in an automatic speech recognition system , 2001, INTERSPEECH.

[39]  Chiu-yu Tseng,et al.  Mandarin spontaneous narrative planning - prosodic evidence from national taiwan university lecture corpus , 2009, INTERSPEECH.

[40]  Chung-Hsien Wu,et al.  Personalized Spectral and Prosody Conversion Using Frame-Based Codeword Distribution and Adaptive CRF , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Tatsuya Kawahara,et al.  Statistical Transformation of Language and Pronunciation Models for Spontaneous Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Junichi Yamagishi,et al.  Utilising spontaneous conversational speech in HMM-based speech synthesis , 2010, SSW.

[43]  Chung-Hsien Wu,et al.  Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Takashi Nose,et al.  Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis , 2014, Speech Commun..

[45]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[46]  Junichi Yamagishi,et al.  Average-Voice-Based Speech Synthesis , 2006 .

[47]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[48]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005, Systems and Computers in Japan.

[49]  Kishore Prahallad,et al.  Sub-Phonetic Modeling For Capturing Pronunciation Variations For Conversational Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[50]  Alan W. Black,et al.  Optimizing segment label boundaries for statistical speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[51]  Chung-Hsien Wu,et al.  Fluent personalized speech synthesis with prosodic word-level spontaneous speech generation , 2015, INTERSPEECH.

[52]  Chin-Hui Lee,et al.  A penalized logistic regression approach to detection based phone classification , 2008, INTERSPEECH.

[53]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.