Perception of synthetic speech generated by rule

As the use of voice response systems employing synthetic speech becomes more widespread in consumer products, industrial and military applications, and aids for the handicapped, it will be necessary to develop reliable methods of comparing different synthesis systems and of assessing how human observers perceive and respond to the speech generated by these systems. The selection of a specific voice response system for a particular application depends on a wide variety of factors only one of which is the inherent intelligibility of the speech generated by the synthesis routines. In this paper, we describe the results of several studies that applied measures of phoneme intelligibility, word recognition, and comprehension to assess the perception of synthetic speech. Several techniques were used to compare performance of different synthesis systems with natural speech and to learn more about how humans perceive synthetic speech generated by rule. Our findings suggest that the perception of synthetic speech depends on an interaction of several factors including the acoustic-phonetic properties of the speech signal, the requirements of the perceptual task, and the previous experience of the listener. Differences in perception between natural speech and high-quality synthetic speech appear to be related to the redundancy of the acoustic-phonetic information encoded in the speech signal.

[1]  G. A. Miller,et al.  The intelligibility of speech as a function of the context of the test materials. , 1951, Journal of experimental psychology.

[2]  A. Salasoo,et al.  Interaction of Knowledge Sources in Spoken Word Identification. , 1985, Journal of memory and language.

[3]  K. D. Kryter,et al.  ARTICULATION-TESTING METHODS: CONSONANTAL DIFFERENTIATION WITH A CLOSED-RESPONSE SET. , 1965, The Journal of the Acoustical Society of America.

[4]  T.C.R.S. Fowler A reading machine for the blind , 1983 .

[5]  David B. Pisoni,et al.  Perceptual evaluation of MITalk: The MIT unrestricted text-to-speech system , 1980, ICASSP.

[6]  Astrid McHugh Listener Preference and Comprehension Tests of Stress Algorithms for a Text-to-Phonetic Speech Synthesis Program. , 1976 .

[7]  G. A. Miller,et al.  An Analysis of Perceptual Confusions Among Some English Consonants , 1955 .

[8]  Martin Chodorow,et al.  Human Factors and Synthetic Speech , 1984 .

[9]  William D. Marslen-Wilson,et al.  Function and process in spoken word recognition: A tutorial review , 1984 .

[10]  Beverly H. Williges,et al.  Synthesized Warning Messages: Effects of an Alerting Cue in Single- and Multiple-Function Voice Synthesis Systems , 1984 .

[11]  David B. Pisoni Speeded classification of natural and synthetic speech in a lexical decision task , 1981 .

[12]  Dennis H. Klatt Timing rules in Klattalk: Implications for models of speech production , 1983 .

[13]  P. Rabbitt,et al.  Channel-Capacity, Intelligibility and Immediate Memory , 1968, The Quarterly journal of experimental psychology.

[14]  J. Allen,et al.  Synthesis of speech from unrestricted text , 1976, Proceedings of the IEEE.

[15]  David B. Pisoni,et al.  Effects of practice on speeded classification of natural and synthetic speech , 1982 .

[16]  J. Allen Reading machines for the blind:The technical problems and the methods adopted for their solution , 1973 .

[17]  W. D. Voiers,et al.  Diagnostic Evaluation of Speech Intelligibility , 1977 .

[18]  D. Pisoni,et al.  Some comparisons of intelligibility of synthetic and natural speech at different speech‐to‐noise ratios , 1982 .

[19]  G. Fairbanks Test of Phonemic Differentiation: The Rhyme Test , 1958 .

[20]  A. Treisman,et al.  Is selective attention selective perception or selective response? A further test. , 1969, Journal of experimental psychology.

[21]  J. P. Egan Articulation testing methods , 1948, The Laryngoscope.

[22]  J. Bookbinder,et al.  Attentional strategies in dichotic listening , 1979 .

[23]  M. Studdert-Kennedy,et al.  Stop-consonant recognition: Release bursts and formant transitions as functionally equivalent, context-dependent cues , 1977 .

[24]  David B. Pisoni,et al.  Capacity demands in short‐term memory for synthetic and natural word lists , 1981 .

[25]  John E. Clark Intelligibility comparisons for two synthetic and one natural speech source , 1983 .

[26]  James J. Jenkins,et al.  Recall of passages of synthetic speech , 1982 .

[27]  C. A. Simpson,et al.  Response Time Effects of Alerting Tone and Semantic Context for Synthesized Voice Cockpit Warnings , 1980, Human factors.

[28]  G. A. Miller,et al.  Some perceptual consequences of linguistic rules , 1963 .

[29]  E C Schwab,et al.  Some Effects of Training on the Perception of Synthetic Speech , 1985, Human factors.

[30]  M. D. Wang,et al.  Consonant confusions in noise: a study of perceptual features. , 1973, The Journal of the Acoustical Society of America.

[31]  Walter Schneider,et al.  Controlled and automatic human information processing: II. Perceptual learning, automatic attending and a general theory. , 1977 .

[32]  Christopher D. Wickens,et al.  The Structure of Attentional Resources , 1980 .

[33]  Howard C. Nusbaum,et al.  Intelligibility of fluent synthetic sentences: Effects of speech rate, pitch contour, and meaning , 1983 .