We describe techniques used in the development of an automatic annotation system for use with a concatenative text-to-speech synthesis system. The goal of the system is to generate automatically from word-level transcriptions annotations that result in synthetic speech of the same quality as that produced from hand-labelled speech. Our approach in this work has been to use the standard technique of “forced-alignment” to each utterance and to refine both acoustic and pronunciation modelling to achieve greater alignment accuracy. Acoustic models were improved by Bayesian speaker adaptation and the use of confidence measures from N-Best decodings to produce speaker dependent HMMs. Pronunciation modelling improvements involved the use of a large pronunciation dictionary containing multiple pronunciations for many words, pronunciation probabilities, the accommodation of interword silences and using information derived from existing manual annotations to guide the recogniser during decoding. At present, the system can reliably produce time-aligned phonetic alignments for UK accents in which the automatic and manual alignments agree on the segmental labelling 93% of the time. It places boundaries with an r.m.s. error of 14.5 ms from the manual boundary. Subjectively, speech produced using automatic alignments is highly intelligible if not quite as good as that produced from manual alignments.
[1]
Simon Haykin,et al.
Digital Communications
,
2017
.
[2]
Maria-Barbara Wesenick,et al.
Estimating the quality of phonetic transcriptions and segmentations of speech signals
,
1996,
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.
[3]
Steve Renals,et al.
WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.
[4]
Chin-Hui Lee,et al.
A study on speaker adaptation of continuous density HMM parameters
,
1990,
International Conference on Acoustics, Speech, and Signal Processing.
[5]
J. Foote,et al.
WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
,
1995
.
[6]
J. H. Page,et al.
The Laureate text-to-speech system : architecture and applications
,
1996
.
[7]
Thomas Schaaf,et al.
Confidence measures for spontaneous speech recognition
,
1997,
1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.