Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis

This paper presents novel approaches based on modulation spectrum (MS) for high-quality statistical parametric speech synthesis, including text-to-speech (TTS) and voice conversion (VC). Although statistical parametric speech synthesis offers various advantages over concatenative speech synthesis, the synthetic speech quality is still not as good as that of concatenative speech synthesis or the quality of natural speech. One of the biggest issues causing the quality degradation is the over-smoothing effect often observed in the generated speech parameter trajectories. Global variance (GV) is known as a feature well correlated with the over-smoothing effect, and the effectiveness of keeping the GV of the generated speech parameter trajectories similar to those of natural speech has been confirmed. However, the quality gap between natural speech and synthetic speech is still large. In this paper, we propose using the MS of the generated speech parameter trajectories as a new feature to effectively quantify the over-smoothing effect. Moreover, we propose postfilters to modify the MS utterance by utterance or segment by segment to make the MS of synthetic speech close to that of natural speech. The proposed postfilters are applicable to various synthesizers based on statistical parametric speech synthesis. We first perform an evaluation of the proposed method in the framework of hidden Markov model (HMM)-based TTS, examining its properties from different perspectives. Furthermore, effectiveness of the proposed postfilters are also evaluated in Gaussian mixture model (GMM)-based VC and classification and regression trees (CART)-based TTS (a.k.a., CLUSTERGEN). The experimental results demonstrate that 1) the proposed utterance-level postfilter achieves quality comparable to the conventional generation algorithm considering the GV, and yields significant improvements by applying to the GV-based generation algorithm in HMM-based TTS, 2) the proposed segment-level postfilter capable of achieving low-delay synthesis also yields significant improvements in synthetic speech quality, and 3) the proposed postfilters are also effective in not only HMM-based TTS but also GMM-based VC and CLUSTERGEN.

[1]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  Shigeru Katagiri,et al.  A large-scale Japanese speech database , 1990, ICSLP.

[3]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[4]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[5]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[7]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Misha Pavel,et al.  Intelligibility of speech with filtered time trajectories of spectral envelopes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[10]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[11]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[12]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[14]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[15]  Les E. Atlas,et al.  EURASIP Journal on Applied Signal Processing 2003:7, 668–675 c ○ 2003 Hindawi Publishing Corporation Joint Acoustic and Modulation Frequency , 2003 .

[16]  Hervé Bourlard,et al.  Mel-cepstrum modulation spectrum (MCMS) features for robust ASR , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[17]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[18]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[19]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[20]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[22]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[23]  Heiga Zen,et al.  A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[24]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Hynek Hermansky,et al.  Phoneme recognition using spectral envelope and modulation frequency features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[28]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[29]  Tomoki Toda,et al.  Speaker-Adaptive Speech Synthesis Based on Eigenvoice Conversion and Language-Dependent Prosodic Conversion in Speech-to-Speech Translation , 2011, INTERSPEECH.

[30]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  J. Tao,et al.  A State Duration Generation Algorithm Considering Global Variance for HMM-based Speech Synthesis , 2011 .

[32]  Mark J. F. Gales,et al.  Exploring Rich Expressive Information from Audiobook Data Using Cluster Adaptive Training , 2012, INTERSPEECH.

[33]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[34]  Heiga Zen,et al.  Product of Experts for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Tomoki Toda,et al.  Singing voice conversion method based on many-to-many eigenvoice conversion and training data generation using a singing-to-singing synthesis system , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[36]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  S. King,et al.  Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction , 2012 .

[38]  Tetsuya Takiguchi,et al.  Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[39]  H. Timothy Bunnell,et al.  Optimizations and fitting procedures for the liljencrants-fant model for statistical parametric speech synthesis , 2013, INTERSPEECH.

[40]  Alan W. Black,et al.  Text to speech in new languages without a standardized orthography , 2013, SSW.

[41]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[42]  Kou Tanaka,et al.  A hybrid approach to electrolaryngeal speech enhancement based on spectral subtraction and statistical voice conversion , 2013, INTERSPEECH.

[43]  William J. Byrne,et al.  Fast, low-artifact speech synthesis considering global variance , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Haizhou Li,et al.  Synthetic speech detection using temporal modulation feature , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Tomoki Toda,et al.  Regression approaches to perceptual age control in singing voice conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Kai Yu,et al.  An investigation of implementation and performance analysis of DNN based speech synthesis system , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[47]  Takashi Nose,et al.  Statistical Parametric Speech Synthesis Based on Gaussian Process Regression , 2014, IEEE Journal of Selected Topics in Signal Processing.

[48]  Tuomo Raitio,et al.  DNN-based stochastic postfilter for HMM-based speech synthesis , 2014, INTERSPEECH.

[49]  Takashi Nose,et al.  A Parameter Generation Algorithm Using Local Variance for HMM-Based Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[50]  Florian Eyben,et al.  A frequency-weighted post-filtering transform for compensation of the over-smoothing effect in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Tomoki Toda,et al.  Modulation spectrum-based post-filter for GMM-based Voice Conversion , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[52]  Takashi Nose,et al.  Analysis of spectral enhancement using global variance in HMM-based speech synthesis , 2014, INTERSPEECH.

[53]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Kou Tanaka,et al.  An evaluation of excitation feature prediction in a hybrid approach to electrolaryngeal speech enhancement , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Tomoki Toda,et al.  Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[56]  Tomoki Toda,et al.  Modified post-filter to recover modulation spectrum for HMM-based speech synthesis , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[57]  Alan W. Black,et al.  Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Yannis Agiomyrgiannakis,et al.  Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Rainer Martin,et al.  Reduction of reverberation effects in the MFCC modulation spectrum for improved classification of acoustic signals , 2015, INTERSPEECH.

[62]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[63]  Yoshihiko Nankaku,et al.  The effect of neural networks in statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).