It is often di cult to determine the suitability of a speaker to serve as a model for concatenative text-to-speech synthesis. The perceived quality of a speaker's natural voice is not necessarily predictive of its (even relative) synthetic quality. The selection of female and male speakers on whom to base two synthetic voices for the new AT&T text-to-speech system was made empirically. Brief readings of identical text materials were recorded from pre-selected professional speakers (6 females, and 9 males). Small-scale TTS systems were constructed with a minimal diphone inventory, suitable for synthesizing a limited number of test sentences. Synthesized sentences, and their naturally spoken references, were presented to listeners in a formal listening evaluation. Listeners rated each test sentence independently on intelligibility, naturalness, and pleasantness. A variety of acoustic measurements of the speakers were made in order to determine which acoustic characteristics correlated with subjective synthesis quality. The results have implications both for speaker selection and for improving concatenative synthesis methods.
[1]
Eric Moulines,et al.
High-quality speech modification based on a harmonic + noise model
,
1995,
EUROSPEECH.
[2]
Alan W. Black,et al.
Unit selection in a concatenative speech synthesis system using a large speech database
,
1996,
1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.
[3]
D. Klatt,et al.
Analysis, synthesis, and perception of voice quality variations among female and male talkers.
,
1990,
The Journal of the Acoustical Society of America.
[4]
Yannis Stylianou,et al.
Voice selection for speech synthesis
,
1997
.