Vowel and speaker identification in natural and synthetic speech.
暂无分享,去创建一个
The purpose of this study was to develop a simple means for evaluating the relative quality of synthesizers. A set of 10 monophthongal English vowels was produced by a man, woman, and child. These vowels were synthesized on a Glace‐Holmes synthesizer, using values measured from spectrograms. In addition, synthetic stimuli were generated on the basis of averages published by Peterson and Barney [J. Acoust. Soc. Amer. 24, 175–184 (1951)]. In the latter set, formant values for men, women, and children were combined with the respective fundamental frequencies, resulting in 9 different combinations for each of the 10 vowels. The 150 stimuli were presented, in random order, to 60 trained listeners for both vowel and speaker identification. The overall vowel identification score for the normal set (all three speakers combined) was 79.46%; the over‐all speaker identification score (all 10 vowels combined) was 90.03%. The corresponding scores for the set synthesized from measured spectrograms were 50.87 and 69.73%...