Modulation spectrum-constrained trajectory training algorithm for GMM-based Voice Conversion

This paper presents a novel training algorithm for Gaussian Mixture Model (GMM)-based Voice Conversion (VC). One of the advantages of GMM-based VC is computationally efficient conversion processing enabling to achieve real-time VC applications. On the other hand, the quality of the converted speech is still significantly worse than that of natural speech. In order to address this problem while preserving the computationally efficient conversion processing, the proposed training method enables 1) to use a consistent optimization criterion between training and conversion and 2) to compensate a Modulation Spectrum (MS) of the converted parameter trajectory as a feature sensitively correlated with over-smoothing effects causing quality degradation of the converted speech. The experimental results demonstrate that the proposed algorithm yields significant improvements in term of both the converted speech quality and the conversion accuracy for speaker individuality compared to the basic training algorithm.

[1]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[2]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[4]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[5]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[7]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[8]  Tomoki Toda,et al.  One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[11]  Hynek Hermansky,et al.  Phoneme recognition using spectral envelope and modulation frequency features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Tomoki Toda,et al.  Trajectory training considering global variance for HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Tomoki Toda,et al.  Many-to-many eigenvoice conversion with reference voice , 2009, INTERSPEECH.

[14]  H. Zen,et al.  Continuous Stochastic Feature Mapping Based on Trajectory HMMs , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Tomoki Toda,et al.  Speaker-Adaptive Speech Synthesis Based on Eigenvoice Conversion and Language-Dependent Prosodic Conversion in Speech-to-Speech Translation , 2011, INTERSPEECH.

[16]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[17]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Haizhou Li,et al.  Exemplar-based voice conversion using non-negative spectrogram deconvolution , 2013, SSW.

[19]  Yu Tsao,et al.  Incorporating global variance in the training phase of GMM-based voice conversion , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[20]  Tetsuya Takiguchi,et al.  Voice conversion in time-invariant speaker-independent space , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tomoki Toda,et al.  Regression approaches to perceptual age control in singing voice conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Ricardo Gutierrez-Osuna,et al.  Can voice conversion be used to reduce non-native accents? , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Kou Tanaka,et al.  An evaluation of excitation feature prediction in a hybrid approach to electrolaryngeal speech enhancement , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  John Goddard Close,et al.  Speech Synthesis Based on Hidden Markov Models and Deep Learning , 2016, Res. Comput. Sci..