Improving the Accuracy of the Speech Synthesis Based Phonetic Alignment Using Multiple Acoustic Features

The phonetic alignment of the spoken utterances for speech research are commonly performed by HMM-based speech recognizers, in forced alignment mode, but the training of the phonetic segment models requires considerable amounts of annotated data. When no such material is available, a possible solution is to synthesize the same phonetic sequence and align the resulting speech signal with the spoken utterances. However, without a careful choice of acoustic features used in this procedure, it can perform poorly when applied to continuous speech utterances. In this paper we propose a new method to select the best features to use in the alignment procedure for each pair of phonetic segment classes. The results show that this selection considerably reduces the segment boundary location errors.

[1]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[2]  Sérgio Paulo,et al.  Multilevel annotation of speech signals using weighted finite state transducers , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[3]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[4]  Nick Campbell Autolabelling Japanese ToBI , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[6]  Thierry Dutoit,et al.  High-quality speech synthesis for phonetic speech segmentation , 1997, EUROSPEECH.

[7]  Isabel Trancoso,et al.  Spoken book alignment using WFSTs , 2002 .