论文信息 - Modulation spectrum-based post-filter for GMM-based Voice Conversion

Modulation spectrum-based post-filter for GMM-based Voice Conversion

This paper addresses an over-smoothing effect in Gaussian Mixture Model (GMM)-based Voice Conversion (VC). The flexible use of the statistical approach is one of the major reason why this approach is widely applied to the speech-based systems. However, quality degradation by over-smoothed speech parameter converted is unavoidable problem of statistical modeling. One of common approaches to this over-smoothness in conversion step is to compensate generated features, such as Global Variance (GV), that explicitly express the over-smoothing effect. In statistical Text-To-Speech (TTS) synthesis, we have recently introduced a Modulation Spectrum (MS) which is an extended form of GV, and have proposed MS-based Post-Filter (MSPF) in Hidden Markov Model (HMM)-based TTS synthesis. In this paper, we apply the MSPF to GMM-based VC. Because the MS of speech parameters is degraded through GMM-based conversion process, we perform the post-filter due to MS modification of converted parameters. The experimental evaluation yields the quality benefits by the proposed post-filter.

[1] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[2] Keiichi Tokuda,et al. Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005 .

[3] Hynek Hermansky,et al. Phoneme recognition using spectral envelope and modulation frequency features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] Tomoki Toda,et al. Regression approaches to perceptual age control in singing voice conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Tomoki Toda,et al. Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[6] Keiichi Tokuda,et al. Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005, Systems and Computers in Japan.

[7] Tomoki Toda,et al. A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Yishan Jiao,et al. Improving voice quality of HMM-based speech synthesis using voice conversion method , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Shigeru Katagiri,et al. A large-scale Japanese speech database , 1990, ICSLP.

[10] Tomoki Toda,et al. Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[11] Hideki Kawahara,et al. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[12] Florian Eyben,et al. A frequency-weighted post-filtering transform for compensation of the over-smoothing effect in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Yu Tsao,et al. Incorporating global variance in the training phase of GMM-based voice conversion , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[14] Tomoki Toda,et al. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Tomoki Toda,et al. Speaker-Adaptive Speech Synthesis Based on Eigenvoice Conversion and Language-Dependent Prosodic Conversion in Speech-to-Speech Translation , 2011, INTERSPEECH.

[16] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17] R. Plomp,et al. Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[18] Ricardo Gutierrez-Osuna,et al. Can voice conversion be used to reduce non-native accents? , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Kou Tanaka,et al. An evaluation of excitation feature prediction in a hybrid approach to electrolaryngeal speech enhancement , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).