DNN-based stochastic postfilter for HMM-based speech synthesis

In this paper we propose a deep neural network to model the conditional probability of the spectral differences between natural and synthetic speech. This allows us to reconstruct the spectral fine structures in speech generated by HMMs. We compared the new stochastic data-driven postfilter with global variance based parameter generation and modulation spectrum enhancement. Our results confirm that the proposed method significantly improves the segmental quality of synthetic speech compared to the conventional methods. Index Terms: HMM, speech synthesis, DNN, modulation spectrum, postfilter, segmental quality

[1]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006, Blizzard Challenge.

[2]  BART KOSKO,et al.  Bidirectional associative memories , 1988, IEEE Trans. Syst. Man Cybern..

[3]  Dong Yu,et al.  Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Li-Rong Dai,et al.  Using bidirectional associative memories for joint spectral envelope modeling in voice conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[8]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.

[10]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[11]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Yu Hu,et al.  Global variance modeling on the log power spectrum of LSPs for HMM-based speech synthesis , 2010, INTERSPEECH.

[13]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[14]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[15]  Li-Rong Dai,et al.  Voice conversion using generative trained deep neural networks with multiple frame spectral envelopes , 2014, INTERSPEECH.

[16]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[17]  Ruslan Salakhutdinov,et al.  Learning Deep Generative Models , 2009 .

[18]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[19]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[20]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.