Modulation spectrum-based post-filter for GMM-based Voice Conversion

This paper addresses an over-smoothing effect in Gaussian Mixture Model (GMM)-based Voice Conversion (VC). The flexible use of the statistical approach is one of the major reason why this approach is widely applied to the speech-based systems. However, quality degradation by over-smoothed speech parameter converted is unavoidable problem of statistical modeling. One of common approaches to this over-smoothness in conversion step is to compensate generated features, such as Global Variance (GV), that explicitly express the over-smoothing effect. In statistical Text-To-Speech (TTS) synthesis, we have recently introduced a Modulation Spectrum (MS) which is an extended form of GV, and have proposed MS-based Post-Filter (MSPF) in Hidden Markov Model (HMM)-based TTS synthesis. In this paper, we apply the MSPF to GMM-based VC. Because the MS of speech parameters is degraded through GMM-based conversion process, we perform the post-filter due to MS modification of converted parameters. The experimental evaluation yields the quality benefits by the proposed post-filter.

[1]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[2]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005 .

[3]  Hynek Hermansky,et al.  Phoneme recognition using spectral envelope and modulation frequency features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Tomoki Toda,et al.  Regression approaches to perceptual age control in singing voice conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[6]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005, Systems and Computers in Japan.

[7]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yishan Jiao,et al.  Improving voice quality of HMM-based speech synthesis using voice conversion method , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Shigeru Katagiri,et al.  A large-scale Japanese speech database , 1990, ICSLP.

[10]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[11]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[12]  Florian Eyben,et al.  A frequency-weighted post-filtering transform for compensation of the over-smoothing effect in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yu Tsao,et al.  Incorporating global variance in the training phase of GMM-based voice conversion , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[14]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Tomoki Toda,et al.  Speaker-Adaptive Speech Synthesis Based on Eigenvoice Conversion and Language-Dependent Prosodic Conversion in Speech-to-Speech Translation , 2011, INTERSPEECH.

[16]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[18]  Ricardo Gutierrez-Osuna,et al.  Can voice conversion be used to reduce non-native accents? , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Kou Tanaka,et al.  An evaluation of excitation feature prediction in a hybrid approach to electrolaryngeal speech enhancement , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).