Articulatory controllable speech modification based on Gaussian mixture models with direct waveform modification using spectrum differential

In our previous work, we have developed a speech modification system capable of manipulating unobserved articulatory movements by sequentially performing speech-to-articulatory inversion mapping and articulatory-to-speech production mapping based on a Gaussian mixture model (GMM)-based statistical feature mapping technique. One of the biggest issues to be addressed in this system is quality degradation of the synthetic speech caused by modeling and conversion errors in a vocoderbased waveform generation framework. To address this issue, we propose several implementation methods of direct waveform modification. The proposed methods directly filter an input speech waveform with a time sequence of spectral differential parameters calculated between unmodified and modified spectral envelop parameters in order to avoid using vocoderbased excitation signal generation. The experimental results show that the proposed direct waveform modification methods yield significantly larger quality improvements in the synthetic speech while also keeping a capability of intuitively modifying phoneme sounds by manipulating the unobserved articulatory movements.

[1]  Tomoki Toda,et al.  Articulatory controllable speech modification based on statistical feature mapping with Gaussian mixture models , 2014, INTERSPEECH.

[2]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[3]  Tomoki Toda,et al.  Statistical singing voice conversion with direct waveform modification based on the spectrum differential , 2014, INTERSPEECH.

[4]  Masaaki Honda,et al.  Determination of the vocal tract spectrum from the articulatory movements based on the search of an articulatory-acoustic database , 1998, ICSLP.

[5]  Juergen Schroeter,et al.  Speech coding based on physiological models of speech production , 1992 .

[6]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[7]  Ren-Hua Wang,et al.  Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge , 2008, INTERSPEECH.

[8]  Cecil H. Coker,et al.  Articulatory analysis and synthesis of speech , 1989, Fourth IEEE Region 10 International Conference TENCON.

[9]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[10]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Alan Wrench,et al.  Continuous speech recognition using articulatory data , 2000, INTERSPEECH.

[12]  V. Gracco,et al.  Accurate recovery of articulator positions from acoustics: new conclusions based on human data. , 1996, The Journal of the Acoustical Society of America.

[13]  Masaaki Honda,et al.  Speaker Adaptation Method for Acoustic-to-Articulatory Inversion using an HMM-Based Speech Production Model , 2004, IEICE Trans. Inf. Syst..

[14]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[15]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Man Mohan Sondhi,et al.  Techniques for estimating vocal-tract shapes from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[17]  Christopher T Kello,et al.  A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. , 2004, The Journal of the Acoustical Society of America.

[18]  Bajibabu Bollepalli,et al.  Modelling a Noisy-channel for Voice Conversion Using Articulatory Features , 2012, INTERSPEECH.

[19]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.