Creating synthetic voices for children by adapting adult average voice using stacked transformations and VTLN

This paper describes experiments in creating personalised children's voices for HMM-based synthesis by adapting either an adult or child average voice. The adult average voice is trained from a large adult speech database, whereas the child average voice is trained using a small database of children's speech. Here we present the idea to use stacked transformations for creating synthetic child voices, where the child average voice is first created from the adult average voice through speaker adaptation using all the pooled speech data from multiple children and then adding child specific speaker adaptation on top of it. VTLN is applied to speech synthesis to see whether it helps the speaker adaptation when only a small amount of adaptation data is available. The listening test results show that the stacked transformations significantly improve speaker adaptation for small amounts of data, but the additional benefit provided by VTLN is not yet clear.

[1]  Oliver Watts,et al.  The CSTR/EMIME HTS system for Blizzard Challenge 2010 , 2010 .

[2]  Peter Smit Stacked transformations for foreign accented speech recognition , 2011 .

[3]  Oliver Watts,et al.  Roles of the average voice in speaker-adaptive HMM-based speech synthesis , 2010, INTERSPEECH.

[4]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Oliver Watts,et al.  HMM-based synthesis of child speech , 2008, WOCCI.

[6]  Keiichi Tokuda,et al.  Recursive Calculation of Mel-Cepstrum from LP Coefficients , 1994 .

[7]  Srinivasan Umesh,et al.  Study of jacobian compensation using linear transformation of conventional MFCC for VTLN , 2008, INTERSPEECH.

[8]  Takao Kobayashi,et al.  A Study on Average Voice Model Training Using Vocal Tract Length Normalization , 2003 .

[9]  Takao Kobayashi,et al.  Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis , 2006, INTERSPEECH.

[10]  Oliver Watts,et al.  Synthesis of Child Speech With HMM Adaptation and Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Hui Liang,et al.  VTLN adaptation for statistical speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Hui Liang,et al.  Implementation of VTLN for statistical speech synthesis , 2010, SSW.

[13]  Mirjam Wester,et al.  Rapid Adaptation of Foreign-Accented HMM-Based Speech Synthesis , 2011, INTERSPEECH.

[14]  Simon King,et al.  Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora , 2009, IEEE Transactions on Audio, Speech, and Language Processing.