Automatic segmentation combining an HMM-based approach and spectral boundary correction

Currently, AT&T Labs’ Natural Voices multilingual TTS system produces high-quality synthetic speech with a largescale speech corpus [1]. In the development of such systems, automatic segmentation constitutes a major component technology. The prevalent approach for automatic segmentation in speech synthesis is Hidden Markov Model (HMM) based. Even though an HMM-based approach is the most automatic and reliable, there are still several limitations, such as mismatches between hand-labeled transcriptions and HMM alignment labels which can lead to discontinuities in the synthetic speech, or the need for hand-labeled bootstrap data in HMM initialization. This paper introduces a new approach to automatic segmentation which aims both to minimize human intervention and to achieve a higher segmental quality of synthetic speech in unit-concatenative speech synthesis, by combining a conventional HMM-based approach and spectral boundary correction. A preference test demonstrates the proposed method is effective in reducing discontinuities in synthetic speech.