论文信息 - Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping

Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping

Speaker adaptation techniques use a small amount of data to modify Hidden Markov Model (HMM) based speech synthesis systems to mimic a target voice. These techniques can be used to provide personalized systems to people who suffer some speech impairment and allow them to communicate in a more natural way. Although the adaptation techniques don’t require a big quantity of data, the recording process can be tedious if the user has speaking problems. To improve the acceptance of these systems an important factor is to be able to obtain acceptable results with minimal amount of recordings. In this work we explore the performance of an adaptation method based on Frequency Warping which uses only vocalic segments according to the amount of available training data.

Inma Hernáez | Eva Navas | Daniel Erro | Agustín Alonso

[1] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[2] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3] Inma Hernáez,et al. Speaker adaptation using only vocalic segments via frequency warping , 2015, INTERSPEECH.

[4] Inma Hernáez,et al. Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[5] Inma Hernáez,et al. Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations , 2015, Comput. Speech Lang..

[6] Inma Hernáez,et al. Improving the Quality of Standard GMM-Based Voice Conversion Systems by Considering Physically Motivated Linear Transformations , 2012, IberSPEECH.

[7] Daniel Erro,et al. Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8] O. Cappé,et al. Regularized estimation of cepstrum envelope from discrete frequency points , 1995, Proceedings of 1995 Workshop on Applications of Signal Processing to Audio and Accoustics.

[9] Phil D. Green,et al. Building personalised synthetic voices for individuals with severe speech impairment , 2013, Comput. Speech Lang..

[10] Heiga Zen,et al. Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[11] Olivier Rosec,et al. Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Paul Taylor,et al. Text-to-Speech Synthesis , 2009 .

[13] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[14] Takao Kobayashi,et al. Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Hermann Ney,et al. Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[16] S. King,et al. Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction , 2012 .

[17] Simon King,et al. Reconstructing voices within the multiple-average-voice-model framework , 2015, INTERSPEECH.

[18] Heiga Zen,et al. Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Eric Moulines,et al. Voice transformation using PSOLA technique , 1991, Speech Commun..

[20] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21] Inma Hernáez,et al. Harmonics Plus Noise Model Based Vocoder for Statistical Parametric Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.