DTW-based phonetic alignment using multiple acoustic features

This paper presents the results of our effort in improving the accuracy of a DTW-based automatic phonetic aligner. The adopted model assumes that the phonetic segment sequence is already known and so the goal is only to align the spoken utterance with a reference synthetic signal produced by waveform concatenation without prosodic modifications. Instead of using a single acoustic measure to compute the alignment cost function, our strategy uses a combination of acoustic features depending on the pair of phonetic segment classes being aligned. The results show that this strategy considerably reduces the segment boundary location errors, even when aligning synthetic and natural speech signals of different gender speakers.

[1]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[2]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[3]  Thierry Dutoit,et al.  High-quality speech synthesis for phonetic speech segmentation , 1997, EUROSPEECH.

[4]  Nick Campbell Autolabelling Japanese ToBI , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Rajiv Laroia,et al.  Robust and efficient quantization of speech LSP parameters using structured vector quantizers , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Sérgio Paulo,et al.  Multilevel annotation of speech signals using weighted finite state transducers , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[7]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.