MIMIC : a voice-adaptive phonetic-tree speech synthesiser

This paper presents Mimic : a decision-tree based concatenative voice adaptive text to sp eech synthesiser. Mimic integrates text to sp eech synthesis (TTS) with speech recogn ition and speaker adaptation. Speech is synthesised from concaten ion of triphone synthesis units. The triphone units are obtained from clusters of training examples modelled, labelled and segmented using clustered HMMs and Viterbi segmentation. The prosodic structure of pitch, duration and energy contours are captured using statistical training methods. The concept of a decisiontree based statistical micro-prosody model is introduced as a hierarchical method of modelling prosodic parameters. The voice adaptation component involves the adaptation of the spectral parameters as well as pitch, duration, and energy.

[1]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.