论文信息 - Review of text-to-speech conversion for English.

Review of text-to-speech conversion for English.

The automatic conversion of English text to synthetic speech is presently being performed, remarkably well, by a number of laboratory systems and commercial devices. Progress in this area has been made possible by advances in linguistic theory, acoustic-phonetic characterization of English sound patterns, perceptual psychology, mathematical modeling of speech production, structured programming, and computer hardware design. This review traces the early work on the development of speech synthesizers, discovery of minimal acoustic cues for phonetic contrasts, evolution of phonemic rule programs, incorporation of prosodic rules, and formulation of techniques for text analysis. Examples of rules are used liberally to illustrate the state of the art. Many of the examples are taken from Klattalk, a text-to-speech system developed by the author. A number of scientific problems are identified that prevent current systems from achieving the goal of completely human-sounding speech. While the emphasis is on rule programs that drive a format synthesizer, alternatives such as articulatory synthesis and waveform concatenation are also reviewed. An extensive bibliography has been assembled to show both the breadth of synthesis activity and the wealth of phenomena covered by rules in the best of these programs. A recording of selected examples of the historical development of synthetic speech, enclosed as a 33 1/3-rpm record, is described in the Appendix.

D H Klatt | D. Klatt

[1] J. Q. Stewart. An Electrical Analogue of the Vocal Organs , 1922, Nature.

[2] R. Potter. Introduction to Technical Discussions of Sound Portrayal , 1946 .

[3] D. Bolinger. Intonation: Levels Versus Configurations , 1951 .

[4] G. E. Peterson,et al. Control Methods Used in a Study of the Vowels , 1951 .

[5] F S COOPER,et al. The interconversion of audible and visible patterns as a basis for research in the perception of speech. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[6] A. Liberman,et al. Some Experiments on the Perception of Synthetic Speech Sounds , 1952 .

[7] I. Hirsh,et al. Development of materials for speech audiometry. , 1952, The Journal of speech and hearing disorders.

[8] A. House,et al. The Influence of Consonant Environment upon the Secondary Acoustical Characteristics of Vowels , 1953 .

[9] K. Stevens,et al. An Electrical Analog of the Vocal Tract , 1953 .

[10] C. Harris. A Study of the Building Blocks in Speech , 1953 .

[11] A. Liberman,et al. Acoustic Loci and Transitional Cues for Consonants , 1954 .

[12] A. Liberman,et al. The role of consonant-vowel transitions in the perception of the stop and nasal consonants. , 1954 .

[13] K. Stevens,et al. Development of a Quantitative Description of Vowel Articulation , 1955 .

[14] A. Malécot. Acoustic clues for nasal consonants; an experimental study involving a tape-splicing technique. , 1956 .

[15] R. Miller. Nature of the Vocal Cord Wave , 1956 .

[16] A. Liberman,et al. Acoustic Cues for the Perception of Initial /w, j, r, l/ in English , 1957 .

[17] L. Lisker. Minimal Cues for Separating /w, r, l, y/ in Intervocalic Position , 1957 .

[18] A. Liberman,et al. Some Cues for the Distinction Between Voiced and Voiceless Stops in Initial Position , 1957 .

[19] Jean‐Pierre A. Radley,et al. Acoustic Properties of Stop Consonants , 1957 .

[20] T. Chiba. The vowel, its nature and structure , 1958 .

[21] G. E. Peterson,et al. Segmentation Techniques in Speech Synthesis , 1958 .

[22] William S.-Y. Wang,et al. Segment Inventory for Speech Synthesis , 1958 .

[23] G. Rosen. Dynamic analog speech synthesizer , 1958 .

[24] D. Fry. Experiments in the Perception of Stress , 1958 .

[25] Ilse Lehiste,et al. An Acoustic – Phonetic Study of Internal Open Juncture , 1959 .

[26] A. Liberman,et al. Minimal Rules for Synthesizing Speech , 1959 .

[27] K. Stevens,et al. Detectability of Small Irregularities in a Broad‐Band Noise Spectrum , 1959 .

[28] G. E. Peterson,et al. Linguistic Considerations in the Study of Speech Intelligibility , 1959 .

[29] G. E. Peterson,et al. Some Basic Considerations in the Analysis of Intonation , 1960 .

[30] Eva Sivertsen,et al. Segment Inventories for Speech Synthesis , 1960 .

[31] E. Uldall. Attitudinal Meanings Conveyed by Intonation Contours , 1960 .

[32] K. Stevens,et al. An acoustical theory of vowel production and some of its implications. , 1961, Journal of speech and hearing research.

[33] John L. Kelly,et al. An Artificial Talker Driven from a Phonetic Input , 1961 .

[34] K. Stevens,et al. On the Properties of Voiceless Fricative Consonants , 1961 .

[35] M. Mathews,et al. Pitch Synchronous Analysis of Voiced Sounds , 1961 .

[36] A. House. On Vowel Duration in English , 1961 .

[37] J. E. Karlin,et al. Iso‐Preference Method for Evaluating Speech Transmission Circuits , 1961 .

[38] O. Fujimura. Analysis of Nasal Consonants , 1962 .

[39] G. Fairbanks,et al. Diphthong formants and their movements. , 1962, Journal of speech and hearing research.

[40] K. Stevens,et al. Perturbation of vowel articulations by consonantal context: an acoustical study. , 1963, Journal of speech and hearing research.

[41] H. Maxey. Terminal‐Analog Synthesis of Voiced Fricatives , 1963 .

[42] B. Lindblom. Spectrographic Study of Vowel Reduction , 1963 .

[43] K. D. Kryter,et al. ARTICULATION-TESTING METHODS: CONSONANTAL DIFFERENTIATION WITH A CLOSED-RESPONSE SET. , 1965, The Journal of the Acoustical Society of America.

[44] I. Lehiste. ACOUSTICAL CHARACTERISTICS OF SELECTED ENGLISH CONSONANTS , 1965 .

[45] D. Fry. The Dependence of Stress Judgments on Vowel Formant Structure , 1965 .

[46] S. Ohman. Coarticulation in VCV utterances: spectrographic measurements. , 1966, The Journal of the Acoustical Society of America.

[47] J. Hoard. Juncture and Syllable Structure in English , 1966 .

[48] I. Mattingly. Synthesis by Rule of Prosodic Features , 1966 .

[49] L. Lisker,et al. Some Effects of Context On Voice Onset Time in English Stops , 1967, Language and speech.

[50] S. Ohman. Word and sentence intonation, a quantitative model , 1967 .

[51] A M Liberman,et al. Perception of the speech code. , 1967, Psychological review.

[52] M. F. Schwartz. Transitions in American English /s/ as cues to the identity of adjacent stop consonants. , 1967, The Journal of the Acoustical Society of America.

[53] B. Gold,et al. Analysis of digital and analog formant synthesizers , 1968 .

[54] Lawrence R. Rabiner,et al. Speech synthesis by rule: An acoustic domain approach , 1968 .

[55] Noam Chomsky,et al. The Sound Pattern of English , 1968 .

[56] Iise Lehiste,et al. Readings in Acoustic Phonetics , 1968 .

[57] I. Hirsh. Intonation, Perception, and Language. , 1968 .

[58] J. Hart,et al. On the anatomy of intonation , 1968 .

[59] O. Fujimura. An approximation to voice aperiodicity , 1968 .

[60] J. Flanagan,et al. Self-oscillating source for vocal-tract synthesizers , 1968 .

[61] C. Peck. An acoustic investigation of the intonation of American English , 1969 .

[62] IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[63] Patrick Suppes,et al. Institute for Mathematical Studies in the Social Sciences , 1969 .

[64] F. Lee,et al. Reading machine: From text to speech , 1969 .

[65] S. Hiki. Control Rule of the Tongue Movement for Dynamic Analog Speech Synthesis , 1970 .

[66] William A. Woods,et al. Computational Linguistics Transition Network Grammars for Natural Language Analysis , 2022 .

[67] M. Halliday. Functional diversity in language as seen from a consideration of modality and mood in English , 1970 .

[68] D. Broad,et al. Formant-frequency trajectories in selected CVC-syllable nuclei. , 1970, The Journal of the Acoustical Society of America.

[69] D. Klatt. Synthesis of Stop Consonants in Initial Position , 1970 .

[70] Lawrence R. Rabiner,et al. Computer synthesis of speech by concatenation of formant-coded words , 1971 .

[71] Victoria A. Fromkin,et al. The Non-Anomalous Nature of Anomalous Utterances , 1971 .

[72] James E. Hoard. Aspiration, Tenseness, and Syllabication in English. , 1971 .

[73] T. P. Barnwell,et al. An algorithm for segment durations in a reading machine context , 1971 .

[74] A. Rosenberg. Effect of glottal pulse shape on the quality of natural vowels. , 1969, The Journal of the Acoustical Society of America.

[75] J. Markel. Digital inverse filtering-a new tool for formant trajectory estimation , 1972 .

[76] J. Flanagan,et al. Synthesis of voiced sounds from a two-mass model of the vocal cords , 1972 .

[77] James L. Flanagan,et al. Wiring telephone apparatus from computer-generated speech , 1972 .

[78] P. Ladefoged,et al. Binary Suprasegmental Features and Transformational Word-Accentuation Rules. , 1972 .

[79] D. Bolinger. Accent Is Predictable (If You're a Mind-Reader) , 1972 .

[80] J. Makhoul. Spectral analysis of speech by linear prediction , 1973 .

[81] D. K. Oller,et al. The effect of position in utterance on speech segment duration in English. , 1973, The Journal of the Acoustical Society of America.

[82] J. Hart,et al. Intonation by rule: a perceptual quest , 1973 .

[83] Peter Ladefoged,et al. The Features of the Larynx. , 1973 .

[84] D. Klatt. Letter: Interaction between two factors that influence vowel duration. , 1973, The Journal of the Acoustical Society of America.

[85] J. Holmes,et al. The influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer , 1973 .

[86] N. Umeda,et al. Automatic synthesis from ordinary english test , 1973 .

[87] W. Ainsworth. A system for converting english text into speech , 1973 .

[88] P. Mermelstein. Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[89] M. Haggard. Abbreviation of Consonants in English Pre- and Post-Vocalic Clusters. , 1973 .

[90] I R Titze,et al. The Human Vocal Cords: A Mathematical Model , 1974, Phonetica: International Journal of Phonetic Science.

[91] Franklin S. Cooper,et al. A plan for the field evaluation of an automated reading system for the blind , 1973 .

[92] C. Coker,et al. Allophonic variation in American English , 1974 .

[93] Richard C. Atkinson. Teaching Children to Read Using a Computer. , 1974 .

[94] Alphonse Chapanis,et al. The Effects of 10 Communication Modes on the Behavior of Teams During Co-Operative Problem-Solving , 1974, Int. J. Man Mach. Stud..

[95] J. Olive,et al. Rule-synthesis of speech by word concatenation: a first step. , 1974, The Journal of the Acoustical Society of America.

[96] S. Maeda. Characterization of fundamental‐frequency contours of speech , 1974 .

[97] D. Klatt. The duration of (s) in English words. , 1974, Journal of speech and hearing research.

[98] D. Klatt. Voice onset time, frication, and aspiration in word-initial consonant clusters. , 1975, Journal of speech and hearing research.

[99] I. Lehiste,et al. Role of duration in disambiguating syntactically ambiguous sentences , 1975 .

[100] V. Zue,et al. The role of phonological rules in speech understanding research , 1975 .

[101] M. Kahn. Arabic Emphatics: The Evidence for Cultural Determinants of Phonetic Sex-Typing , 1975, Phonetica.

[102] N. Umeda. Vowel duration in American English. , 1975, The Journal of the Acoustical Society of America.

[103] J. McCawley. 4 Review of The Sound Pattern of English , 1975 .

[104] C. Coker,et al. The importance of spectral detail in initial-final contrasts of voiced stops , 1975 .

[105] N. Umeda,et al. The parsing program for automatic text-to-speech synthesis developed at the electrotechnical laboratory in 1968 , 1975 .

[106] D. Klatt. Vowel Lengthening is Syntactically Determined in a Connected Discourse. , 1975 .

[107] I. Lehiste. The Phonetic Structure of Paragraphs , 1975 .

[108] C.H. Coker,et al. A model of articulatory dynamics and control , 1976, Proceedings of the IEEE.

[109] J. Olive,et al. Speech resynthesis from phoneme-related parameters. , 1975, The Journal of the Acoustical Society of America.

[110] Sharon Hunnicutt. Phonological Rules for a Text-to-Speech System , 1976, International Conference on Computational Logic.

[111] J. Allen,et al. Synthesis of speech from unrestricted text , 1976, Proceedings of the IEEE.

[112] Rolf Carlson,et al. A text-to-speech system based entirely on rules , 1976, ICASSP.

[113] N. Umeda,et al. Linguistic rules for text-to-speech synthesis , 1976, Proceedings of the IEEE.

[114] D. Klatt. Linguistic uses of segmental duration in English: acoustic and perceptual evidence. , 1976, The Journal of the Acoustical Society of America.

[115] P. Ladefoged,et al. Fundamental problems in phonetics , 1977 .

[116] Douglas D. O'Shaughnessy. Fundamental frequency by rule for a text-to-speech system , 1977 .

[117] L. Nakatani,et al. Locus of segmental cues for word juncture. , 1977, The Journal of the Acoustical Society of America.

[118] R. B. Monsen,et al. Study of variations in the male and female glottal wave. , 1976, The Journal of the Acoustical Society of America.

[119] M. Halle,et al. English stress : its form, its growth, and its role in verse , 1977 .

[120] L. Lisker. Rapid versus rabid: A catalogue of acoustic features that may cue the distinction , 1977 .

[121] L L Elliott,et al. Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability. , 1977, The Journal of the Acoustical Society of America.

[122] J. Olive,et al. Rule synthesis of speech from dyadic units , 1977 .

[123] N. Umeda. Consonant duration in American English , 1977 .

[124] Richard T. Gagnon,et al. Votrax real time hardware for phoneme synthesis of speech , 1978, ICASSP.

[125] L. Nakatani,et al. Hearing "words" without words: prosodic cues for word perception. , 1978, The Journal of the Acoustical Society of America.

[126] William E Cooper,et al. Hierarchical coding in speech timing , 1978, Cognitive Psychology.

[127] M. Liberman. Phonetic transcription, stress, and segment durations from spelled proper names , 1978 .

[128] L. Streeter. Acoustic determinants of phrase boundary perception. , 1978, The Journal of the Acoustical Society of America.

[129] I. Titze,et al. A theoretical study of the effects of various laryngeal configurations on the acoustics of phonation. , 1979, The Journal of the Acoustical Society of America.

[130] V. Zue,et al. Acoustic study of medial /t,d/ in American English , 1979 .

[131] Frances Ingeman. Speech synthesis by rule using the fove program , 1979 .

[132] F. Fallside,et al. Speech synthesis from concept: A method for speech output from information systems , 1979 .

[133] M. Liberman,et al. A set of concatenative units for speech synthesis , 1979 .

[134] Douglas D. O'Shaughnessy,et al. Linguistic features in fundamental frequency patterns , 1979 .

[135] David J. Broad,et al. The New Theories of Vocal Fold Vibration , 1979 .

[136] Patrick Suppes,et al. Current Trends in Computer-Assisted Instruction , 1979, Adv. Comput..

[137] Dennis H. Klatt,et al. Software for a cascade/parallel formant synthesizer , 1980 .

[138] David B. Pisoni,et al. Unlimited text-to-speech system: Description and evaluation of a microprocessor based device , 1980, ICASSP.

[139] Catherine P. Browman. Rules for demisyllable synthesis using Lingua, a language interpreter , 1980, ICASSP.

[140] Richard Wiggins. An integrated circuit for speech synthesis , 1980, ICASSP.

[141] Sheri Hunnicutt. Grapheme-to-phoneme rules: A review , 1980 .

[142] B. Lindblom,et al. Modeling the judgment of vowel quality differences. , 1981, The Journal of the Acoustical Society of America.

[143] Noriko Umeda,et al. Boundary perception in fluent speech , 1981 .

[144] N. Umeda,et al. Word duration as an acoustic measure of boundary perception , 1981 .

[145] K Galyas,et al. A multi-language, portable text-to-speech system for the disabled. , 1981, Journal of biomedical engineering.

[146] N. Umeda. Influence of segmental factors on fundamental frequency in fluent speech , 1981 .

[147] Jared Bernstein,et al. Performance Comparison of Component Algorithms for the Phonemicization of Orthography , 1981, ACL.

[148] David W. Shipman,et al. Letter‐to‐phoneme rules: A semi‐automatic discovery procedure , 1982 .

[149] Bishnu S. Atal,et al. A new model of LPC excitation for producing natural-sounding speech at low bit rates , 1982, ICASSP.

[150] D. Kewley-Port. Measurement of formant transitions in naturally produced stop consonant-vowel syllables. , 1982, The Journal of the Acoustical Society of America.

[151] L. Henderson. Orthography and Word Recognition in Reading , 1982 .

[152] J. G. Martin,et al. Perception of anticipatory coarticulation effects in vowel-stop consonant-bowel sequences. , 1982, Journal of experimental psychology. Human perception and performance.

[153] Sheri Hunnicutt,et al. Bliss communication with speech or text output , 1982, ICASSP.

[154] S. Hertz. From text to speech with SRS , 1982 .

[155] Dennis H. Klatt,et al. Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.

[156] Sheri Hunnicutt,et al. A multi-language text-to-speech module , 1982, ICASSP.

[157] D. Pisoni,et al. Some comparisons of intelligibility of synthetic and natural speech at different speech‐to‐noise ratios , 1982 .

[158] W. Strong,et al. A model for the synthesis of natural sounding vowels , 1983 .

[159] Kenneth Ward Church. Phrase-structure parsing: a method for taking advantage of allophonic constraints , 1983 .

[160] T.C.R.S. Fowler. A reading machine for the blind , 1983 .

[161] D. O'Shaughnessy,et al. Linguistic modality effects on fundamental frequency in speech. , 1983, The Journal of the Acoustical Society of America.

[162] Dik Lun Lee,et al. Voice response systems , 1983, CSUR.

[163] D. Ladd. Phonological Features of Intonational Peaks , 1983 .

[164] T. Feustel,et al. Capacity Demands in Short-Term Memory for Synthetic and .Natural Speech , 1983, Human factors.

[165] J. N. Holmes,et al. Formant synthesizers: Cascade or parallel? , 1983, Speech Commun..

[166] Robert T. Lund,et al. University-to-industry advanced technology transfer: A case study , 1983 .

[167] J. Bernstein,et al. Fundamental frequency in sentence production , 1984, Proceedings of the IEEE.

[168] J. N. Holmes,et al. Implementation of a parallel-formant speech synthesiser using a single-chip programmable signal processor , 1984 .

[169] P. W. Nye,et al. Evolution of reading machines for the blind: Haskins Laboratories' research as a case history. , 1984, Journal of rehabilitation research and development.

[170] Rolf Carlson,et al. Swedish Speech Researchers Team Up with Electronic Ventrure Capitalists , 1984 .

[171] T. Carrell. Contributions of Fundamental Frequency, Formant Spacing, and Glottal Waveform to Talker Identification. Research on Speech Perception. Technical Report No. 5. , 1984 .

[172] X. Rodet. Time — Domain Formant — Wave — Function Synthesis , 1984 .

[173] Hy Murveit,et al. Telephone communication between deaf and hearing persons , 1984, ICASSP.

[174] J. N. Holmes. Speech Technology in the Next Decades , 1984 .

[175] W. M. Rabinowitz,et al. Standardization of a test of speech perception in noise. , 1979, Journal of speech and hearing research.

[176] D. Klatt,et al. Synthesis by rule of Japanese , 1984 .

[177] Mark Liberman,et al. Synthesis by rule of english intonation patterns , 1984, ICASSP '84. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[178] Victor Zue,et al. Properties of consonant sequences within words and across word boundaries , 1984, ICASSP.

[179] J. Olive,et al. Text to speech—An overview , 1985 .

[180] C. Coker. A dictionary‐intensive letter‐to‐sound program , 1985 .

[181] D.B. Pisoni,et al. Perception of synthetic speech generated by rule , 1985, Proceedings of the IEEE.

[182] S.R. Hertz,et al. The delta rule development system for speech synthesis from text , 1985, Proceedings of the IEEE.

[183] George N. Clements,et al. The geometry of phonological features , 1985, Phonology Yearbook.

[184] E C Schwab,et al. Some Effects of Training on the Perception of Synthetic Speech , 1985, Human factors.

[185] C. Browman,et al. Representation of voicing contrasts using articulatory gestures , 1986 .

[186] D. Pisoni,et al. Preference judgments comparing different synthetic voices , 1986 .

[187] Richard K. Olson,et al. Reading instruction and remediation with the aid of computer speech , 1986 .

[188] B. Repp. Perception of the [m]-[n] distinction in CV syllables. , 1986, The Journal of the Acoustical Society of America.

[189] Michael J. Dedina,et al. Comprehension of natural and synthetic speech using a sentence verification task , 1986 .

[190] Kenneth Ward Church. Stress assignment in letter to sound rules for speech synthesis , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[191] C. Wright,et al. Diagnostic evaluation of a synthesizer's acoustic inventory , 1986 .

[192] G. Kopec,et al. Network-based connected digit recognition using explicit acoustic-phonetic modeling , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[193] S. Hunnicutt. Lexical Prediction for A Text-to-Speech System , 1986 .

[194] C. Nixon,et al. The Perception of Synthetic Speech in Noise , 1986 .

[195] D. Klatt. Detailed spectral analysis of a female voice , 1986 .

[196] Sheri Hunnicutt. Bliss Symbol-to-Speech Conversion: "Blisstalk , 1986 .

[197] Hiroya Fujisaki,et al. Proposal and evaluation of models for the glottal source waveform , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[198] Julia Hirschberg,et al. The intonational Structuring of Discourse , 1986, ACL.

[199] K. Stevens,et al. Some Acoustical and Perceptual Correlates of Nasal Vowels , 1987 .

[200] E. J. Lerner,et al. Realism in synthetic speech , 1987 .

[201] David B. Pisoni,et al. Text-to-speech: the mitalk system , 1987 .

[202] Terrence J. Sejnowski,et al. NETtalk: a parallel network that learns to read aloud , 1988 .