Preliminary experiments toward automatic generation of new TTS voices from recorded speech alone

To generate a new concatenative text-to-speech (TTS) voice from recordings of a human’s voice, not only recordings but also additional information such as the transcriptions, prosodic labels, and the phonemic alignments are necessary. Since some of the information depends on the speaking style of the narrator, these types of information need to be manually added by listening to the recordings, which is costly and time consuming. To tackle this problem, we have been working on a totally trainable TTS system every component of which, including the text processing module, can be automatically trained from a speech corpus. In this paper, we refine the framework and propose several submodules to collect all of the linguistic and acoustic information necessary for generating a TTS voice from the recorded speech. Though completely automatic generation of a new voice is not yet possible, we report progress in the submodules by showing experimental results.

[1]  Mahesh Viswanathan,et al.  Recent improvements to the IBM trainable speech synthesis system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  Philip C. Woodland,et al.  Improvements in an HMM-based speech synthesiser , 1995, EUROSPEECH.

[3]  Ryuki Tachibana,et al.  Automatic Accent Labeling for a Text-to-Speech System , 2007 .

[4]  Mark Hasegawa-Johnson,et al.  An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic-prosodic model , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Masafumi Nishimura,et al.  A stochastic approach to phoneme and accent estimation , 2005, INTERSPEECH.

[7]  Alan W. Black,et al.  Impact of durational outlier removal from unit selection catalogs , 2004, SSW.

[8]  Jordi Adell,et al.  Database Pruning for Unsupervised Building of Text-To-Speech Voices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.