COMBINING VOCAL TRACT LENGTH NORMALIZATION WITH LINEAR TRANSFORMATIONS IN A BAYESIAN FRAMEWORK

Recent research has demonstrated the effectiveness of vocal tract length normalization (VTLN) as a rapid adaptation technique for statistical parametric speech synthesis. VTLN produces speech with naturalness preferable to that of MLLR- based adaptation techniques, being much closer in quality to that generated by the original average voice model. By contrast, with just a single parameter, VTLN captures very few speaker specific characteristics when compared to the available linear transform based adaptation techniques. This paper proposes that the merits of VTLN can be combined with those of linear transform based adaptation technique in a Bayesian framework, where VTLN is used as the prior information. A novel technique of propa- gating the gender information from the VTLN prior through constrained structural maximum a posteriori linear regression (CSMAPLR) adaptation is presented. Experiments show that the resulting transformation has improved speech quality with better naturalness, intelligibility and improved speaker similarity.

[1]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[2]  Abeer Alwan,et al.  Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC , 2009, Comput. Speech Lang..

[3]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[4]  Hermann Ney,et al.  Implementing frequency-warping and VTLN through linear transformation of conventional MFCC , 2005, INTERSPEECH.

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Mark J. F. Gales,et al.  Prior information for rapid speaker adaptation , 2010, INTERSPEECH.

[7]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[8]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Hui Liang,et al.  Implementation of VTLN for statistical speech synthesis , 2010, SSW.

[10]  Chin-Hui Lee,et al.  A structural Bayes approach to speaker adaptation , 2001, IEEE Trans. Speech Audio Process..

[11]  Hui Liang,et al.  VTLN adaptation for statistical speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.