Currently, AT&T Labs’ Natural Voices multilingual TTS system produces high-quality synthetic speech with a largescale speech corpus [1]. In the development of such systems, automatic segmentation constitutes a major component technology. The prevalent approach for automatic segmentation in speech synthesis is Hidden Markov Model (HMM) based. Even though an HMM-based approach is the most automatic and reliable, there are still several limitations, such as mismatches between hand-labeled transcriptions and HMM alignment labels which can lead to discontinuities in the synthetic speech, or the need for hand-labeled bootstrap data in HMM initialization. This paper introduces a new approach to automatic segmentation which aims both to minimize human intervention and to achieve a higher segmental quality of synthetic speech in unit-concatenative speech synthesis, by combining a conventional HMM-based approach and spectral boundary correction. A preference test demonstrates the proposed method is effective in reducing discontinuities in synthetic speech.
[1]
Biing-Hwang Juang,et al.
Fundamentals of speech recognition
,
1993,
Prentice Hall signal processing series.
[2]
Andrej Ljolje,et al.
Automatic speech segmentation for concatenative inventory selection
,
1994,
SSW.
[3]
Isabel Trancoso,et al.
Automatic Segment Alignment for Concatenative Speech Synthesis in Portuguese
,
2001
.
[4]
Richard Sproat,et al.
High-accuracy automatic segmentation
,
1999,
EUROSPEECH.
[5]
Marc C. Beutnagel,et al.
The AT & T NEXT-GEN TTS system
,
1999
.
[6]
Yannis Stylianou,et al.
Exploration of acoustic correlates in speaker selection for concatenative synthesis
,
1998,
ICSLP.