Review of text-to-speech conversion for English.

The automatic conversion of English text to synthetic speech is presently being performed, remarkably well, by a number of laboratory systems and commercial devices. Progress in this area has been made possible by advances in linguistic theory, acoustic-phonetic characterization of English sound patterns, perceptual psychology, mathematical modeling of speech production, structured programming, and computer hardware design. This review traces the early work on the development of speech synthesizers, discovery of minimal acoustic cues for phonetic contrasts, evolution of phonemic rule programs, incorporation of prosodic rules, and formulation of techniques for text analysis. Examples of rules are used liberally to illustrate the state of the art. Many of the examples are taken from Klattalk, a text-to-speech system developed by the author. A number of scientific problems are identified that prevent current systems from achieving the goal of completely human-sounding speech. While the emphasis is on rule programs that drive a format synthesizer, alternatives such as articulatory synthesis and waveform concatenation are also reviewed. An extensive bibliography has been assembled to show both the breadth of synthesis activity and the wealth of phenomena covered by rules in the best of these programs. A recording of selected examples of the historical development of synthetic speech, enclosed as a 33 1/3-rpm record, is described in the Appendix.

[1]  J. Q. Stewart An Electrical Analogue of the Vocal Organs , 1922, Nature.

[2]  R. Potter Introduction to Technical Discussions of Sound Portrayal , 1946 .

[3]  D. Bolinger Intonation: Levels Versus Configurations , 1951 .

[4]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[5]  F S COOPER,et al.  The interconversion of audible and visible patterns as a basis for research in the perception of speech. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[6]  A. Liberman,et al.  Some Experiments on the Perception of Synthetic Speech Sounds , 1952 .

[7]  I. Hirsh,et al.  Development of materials for speech audiometry. , 1952, The Journal of speech and hearing disorders.

[8]  A. House,et al.  The Influence of Consonant Environment upon the Secondary Acoustical Characteristics of Vowels , 1953 .

[9]  K. Stevens,et al.  An Electrical Analog of the Vocal Tract , 1953 .

[10]  C. Harris A Study of the Building Blocks in Speech , 1953 .

[11]  A. Liberman,et al.  Acoustic Loci and Transitional Cues for Consonants , 1954 .

[12]  A. Liberman,et al.  The role of consonant-vowel transitions in the perception of the stop and nasal consonants. , 1954 .

[13]  K. Stevens,et al.  Development of a Quantitative Description of Vowel Articulation , 1955 .

[14]  A. Malécot Acoustic clues for nasal consonants; an experimental study involving a tape-splicing technique. , 1956 .

[15]  R. Miller Nature of the Vocal Cord Wave , 1956 .

[16]  A. Liberman,et al.  Acoustic Cues for the Perception of Initial /w, j, r, l/ in English , 1957 .

[17]  L. Lisker Minimal Cues for Separating /w, r, l, y/ in Intervocalic Position , 1957 .

[18]  A. Liberman,et al.  Some Cues for the Distinction Between Voiced and Voiceless Stops in Initial Position , 1957 .

[19]  Jean‐Pierre A. Radley,et al.  Acoustic Properties of Stop Consonants , 1957 .

[20]  T. Chiba The vowel, its nature and structure , 1958 .

[21]  G. E. Peterson,et al.  Segmentation Techniques in Speech Synthesis , 1958 .

[22]  William S.-Y. Wang,et al.  Segment Inventory for Speech Synthesis , 1958 .

[23]  G. Rosen Dynamic analog speech synthesizer , 1958 .

[24]  D. Fry Experiments in the Perception of Stress , 1958 .

[25]  Ilse Lehiste,et al.  An Acoustic – Phonetic Study of Internal Open Juncture , 1959 .

[26]  A. Liberman,et al.  Minimal Rules for Synthesizing Speech , 1959 .

[27]  K. Stevens,et al.  Detectability of Small Irregularities in a Broad‐Band Noise Spectrum , 1959 .

[28]  G. E. Peterson,et al.  Linguistic Considerations in the Study of Speech Intelligibility , 1959 .

[29]  G. E. Peterson,et al.  Some Basic Considerations in the Analysis of Intonation , 1960 .

[30]  Eva Sivertsen,et al.  Segment Inventories for Speech Synthesis , 1960 .

[31]  E. Uldall Attitudinal Meanings Conveyed by Intonation Contours , 1960 .

[32]  K. Stevens,et al.  An acoustical theory of vowel production and some of its implications. , 1961, Journal of speech and hearing research.

[33]  John L. Kelly,et al.  An Artificial Talker Driven from a Phonetic Input , 1961 .

[34]  K. Stevens,et al.  On the Properties of Voiceless Fricative Consonants , 1961 .

[35]  M. Mathews,et al.  Pitch Synchronous Analysis of Voiced Sounds , 1961 .

[36]  A. House On Vowel Duration in English , 1961 .

[37]  J. E. Karlin,et al.  Iso‐Preference Method for Evaluating Speech Transmission Circuits , 1961 .

[38]  O. Fujimura Analysis of Nasal Consonants , 1962 .

[39]  G. Fairbanks,et al.  Diphthong formants and their movements. , 1962, Journal of speech and hearing research.

[40]  K. Stevens,et al.  Perturbation of vowel articulations by consonantal context: an acoustical study. , 1963, Journal of speech and hearing research.

[41]  H. Maxey Terminal‐Analog Synthesis of Voiced Fricatives , 1963 .

[42]  B. Lindblom Spectrographic Study of Vowel Reduction , 1963 .

[43]  K. D. Kryter,et al.  ARTICULATION-TESTING METHODS: CONSONANTAL DIFFERENTIATION WITH A CLOSED-RESPONSE SET. , 1965, The Journal of the Acoustical Society of America.

[44]  I. Lehiste ACOUSTICAL CHARACTERISTICS OF SELECTED ENGLISH CONSONANTS , 1965 .

[45]  D. Fry The Dependence of Stress Judgments on Vowel Formant Structure , 1965 .

[46]  S. Ohman Coarticulation in VCV utterances: spectrographic measurements. , 1966, The Journal of the Acoustical Society of America.

[47]  J. Hoard Juncture and Syllable Structure in English , 1966 .

[48]  I. Mattingly Synthesis by Rule of Prosodic Features , 1966 .

[49]  L. Lisker,et al.  Some Effects of Context On Voice Onset Time in English Stops , 1967, Language and speech.

[50]  S. Ohman Word and sentence intonation, a quantitative model , 1967 .

[51]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[52]  M. F. Schwartz Transitions in American English /s/ as cues to the identity of adjacent stop consonants. , 1967, The Journal of the Acoustical Society of America.

[53]  B. Gold,et al.  Analysis of digital and analog formant synthesizers , 1968 .

[54]  Lawrence R. Rabiner,et al.  Speech synthesis by rule: An acoustic domain approach , 1968 .

[55]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[56]  Iise Lehiste,et al.  Readings in Acoustic Phonetics , 1968 .

[57]  I. Hirsh Intonation, Perception, and Language. , 1968 .

[58]  J. Hart,et al.  On the anatomy of intonation , 1968 .

[59]  O. Fujimura An approximation to voice aperiodicity , 1968 .

[60]  J. Flanagan,et al.  Self-oscillating source for vocal-tract synthesizers , 1968 .

[61]  C. Peck An acoustic investigation of the intonation of American English , 1969 .

[62]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[63]  Patrick Suppes,et al.  Institute for Mathematical Studies in the Social Sciences , 1969 .

[64]  F. Lee,et al.  Reading machine: From text to speech , 1969 .

[65]  S. Hiki Control Rule of the Tongue Movement for Dynamic Analog Speech Synthesis , 1970 .

[66]  William A. Woods,et al.  Computational Linguistics Transition Network Grammars for Natural Language Analysis , 2022 .

[67]  M. Halliday Functional diversity in language as seen from a consideration of modality and mood in English , 1970 .

[68]  D. Broad,et al.  Formant-frequency trajectories in selected CVC-syllable nuclei. , 1970, The Journal of the Acoustical Society of America.

[69]  D. Klatt Synthesis of Stop Consonants in Initial Position , 1970 .

[70]  Lawrence R. Rabiner,et al.  Computer synthesis of speech by concatenation of formant-coded words , 1971 .

[71]  Victoria A. Fromkin,et al.  The Non-Anomalous Nature of Anomalous Utterances , 1971 .

[72]  James E. Hoard Aspiration, Tenseness, and Syllabication in English. , 1971 .

[73]  T. P. Barnwell,et al.  An algorithm for segment durations in a reading machine context , 1971 .

[74]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969, The Journal of the Acoustical Society of America.

[75]  J. Markel Digital inverse filtering-a new tool for formant trajectory estimation , 1972 .

[76]  J. Flanagan,et al.  Synthesis of voiced sounds from a two-mass model of the vocal cords , 1972 .

[77]  James L. Flanagan,et al.  Wiring telephone apparatus from computer-generated speech , 1972 .

[78]  P. Ladefoged,et al.  Binary Suprasegmental Features and Transformational Word-Accentuation Rules. , 1972 .

[79]  D. Bolinger Accent Is Predictable (If You're a Mind-Reader) , 1972 .

[80]  J. Makhoul Spectral analysis of speech by linear prediction , 1973 .

[81]  D. K. Oller,et al.  The effect of position in utterance on speech segment duration in English. , 1973, The Journal of the Acoustical Society of America.

[82]  J. Hart,et al.  Intonation by rule: a perceptual quest , 1973 .

[83]  Peter Ladefoged,et al.  The Features of the Larynx. , 1973 .

[84]  D. Klatt Letter: Interaction between two factors that influence vowel duration. , 1973, The Journal of the Acoustical Society of America.

[85]  J. Holmes,et al.  The influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer , 1973 .

[86]  N. Umeda,et al.  Automatic synthesis from ordinary english test , 1973 .

[87]  W. Ainsworth A system for converting english text into speech , 1973 .

[88]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[89]  M. Haggard Abbreviation of Consonants in English Pre- and Post-Vocalic Clusters. , 1973 .

[90]  I R Titze,et al.  The Human Vocal Cords: A Mathematical Model , 1974, Phonetica: International Journal of Phonetic Science.

[91]  Franklin S. Cooper,et al.  A plan for the field evaluation of an automated reading system for the blind , 1973 .

[92]  C. Coker,et al.  Allophonic variation in American English , 1974 .

[93]  Richard C. Atkinson Teaching Children to Read Using a Computer. , 1974 .

[94]  Alphonse Chapanis,et al.  The Effects of 10 Communication Modes on the Behavior of Teams During Co-Operative Problem-Solving , 1974, Int. J. Man Mach. Stud..

[95]  J. Olive,et al.  Rule-synthesis of speech by word concatenation: a first step. , 1974, The Journal of the Acoustical Society of America.

[96]  S. Maeda Characterization of fundamental‐frequency contours of speech , 1974 .

[97]  D. Klatt The duration of (s) in English words. , 1974, Journal of speech and hearing research.

[98]  D. Klatt Voice onset time, frication, and aspiration in word-initial consonant clusters. , 1975, Journal of speech and hearing research.

[99]  I. Lehiste,et al.  Role of duration in disambiguating syntactically ambiguous sentences , 1975 .

[100]  V. Zue,et al.  The role of phonological rules in speech understanding research , 1975 .

[101]  M. Kahn Arabic Emphatics: The Evidence for Cultural Determinants of Phonetic Sex-Typing , 1975, Phonetica.

[102]  N. Umeda Vowel duration in American English. , 1975, The Journal of the Acoustical Society of America.

[103]  J. McCawley 4 Review of The Sound Pattern of English , 1975 .

[104]  C. Coker,et al.  The importance of spectral detail in initial-final contrasts of voiced stops , 1975 .

[105]  N. Umeda,et al.  The parsing program for automatic text-to-speech synthesis developed at the electrotechnical laboratory in 1968 , 1975 .

[106]  D. Klatt Vowel Lengthening is Syntactically Determined in a Connected Discourse. , 1975 .

[107]  I. Lehiste The Phonetic Structure of Paragraphs , 1975 .

[108]  C.H. Coker,et al.  A model of articulatory dynamics and control , 1976, Proceedings of the IEEE.

[109]  J. Olive,et al.  Speech resynthesis from phoneme-related parameters. , 1975, The Journal of the Acoustical Society of America.

[110]  Sharon Hunnicutt Phonological Rules for a Text-to-Speech System , 1976, International Conference on Computational Logic.

[111]  J. Allen,et al.  Synthesis of speech from unrestricted text , 1976, Proceedings of the IEEE.

[112]  Rolf Carlson,et al.  A text-to-speech system based entirely on rules , 1976, ICASSP.

[113]  N. Umeda,et al.  Linguistic rules for text-to-speech synthesis , 1976, Proceedings of the IEEE.

[114]  D. Klatt Linguistic uses of segmental duration in English: acoustic and perceptual evidence. , 1976, The Journal of the Acoustical Society of America.

[115]  P. Ladefoged,et al.  Fundamental problems in phonetics , 1977 .

[116]  Douglas D. O'Shaughnessy Fundamental frequency by rule for a text-to-speech system , 1977 .

[117]  L. Nakatani,et al.  Locus of segmental cues for word juncture. , 1977, The Journal of the Acoustical Society of America.

[118]  R. B. Monsen,et al.  Study of variations in the male and female glottal wave. , 1976, The Journal of the Acoustical Society of America.

[119]  M. Halle,et al.  English stress : its form, its growth, and its role in verse , 1977 .

[120]  L. Lisker Rapid versus rabid: A catalogue of acoustic features that may cue the distinction , 1977 .

[121]  L L Elliott,et al.  Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability. , 1977, The Journal of the Acoustical Society of America.

[122]  J. Olive,et al.  Rule synthesis of speech from dyadic units , 1977 .

[123]  N. Umeda Consonant duration in American English , 1977 .

[124]  Richard T. Gagnon,et al.  Votrax real time hardware for phoneme synthesis of speech , 1978, ICASSP.

[125]  L. Nakatani,et al.  Hearing "words" without words: prosodic cues for word perception. , 1978, The Journal of the Acoustical Society of America.

[126]  William E Cooper,et al.  Hierarchical coding in speech timing , 1978, Cognitive Psychology.

[127]  M. Liberman Phonetic transcription, stress, and segment durations from spelled proper names , 1978 .

[128]  L. Streeter Acoustic determinants of phrase boundary perception. , 1978, The Journal of the Acoustical Society of America.

[129]  I. Titze,et al.  A theoretical study of the effects of various laryngeal configurations on the acoustics of phonation. , 1979, The Journal of the Acoustical Society of America.

[130]  V. Zue,et al.  Acoustic study of medial /t,d/ in American English , 1979 .

[131]  Frances Ingeman Speech synthesis by rule using the fove program , 1979 .

[132]  F. Fallside,et al.  Speech synthesis from concept: A method for speech output from information systems , 1979 .

[133]  M. Liberman,et al.  A set of concatenative units for speech synthesis , 1979 .

[134]  Douglas D. O'Shaughnessy,et al.  Linguistic features in fundamental frequency patterns , 1979 .

[135]  David J. Broad,et al.  The New Theories of Vocal Fold Vibration , 1979 .

[136]  Patrick Suppes,et al.  Current Trends in Computer-Assisted Instruction , 1979, Adv. Comput..

[137]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[138]  David B. Pisoni,et al.  Unlimited text-to-speech system: Description and evaluation of a microprocessor based device , 1980, ICASSP.

[139]  Catherine P. Browman Rules for demisyllable synthesis using Lingua, a language interpreter , 1980, ICASSP.

[140]  Richard Wiggins An integrated circuit for speech synthesis , 1980, ICASSP.

[141]  Sheri Hunnicutt Grapheme-to-phoneme rules: A review , 1980 .

[142]  B. Lindblom,et al.  Modeling the judgment of vowel quality differences. , 1981, The Journal of the Acoustical Society of America.

[143]  Noriko Umeda,et al.  Boundary perception in fluent speech , 1981 .

[144]  N. Umeda,et al.  Word duration as an acoustic measure of boundary perception , 1981 .

[145]  K Galyas,et al.  A multi-language, portable text-to-speech system for the disabled. , 1981, Journal of biomedical engineering.

[146]  N. Umeda Influence of segmental factors on fundamental frequency in fluent speech , 1981 .

[147]  Jared Bernstein,et al.  Performance Comparison of Component Algorithms for the Phonemicization of Orthography , 1981, ACL.

[148]  David W. Shipman,et al.  Letter‐to‐phoneme rules: A semi‐automatic discovery procedure , 1982 .

[149]  Bishnu S. Atal,et al.  A new model of LPC excitation for producing natural-sounding speech at low bit rates , 1982, ICASSP.

[150]  D. Kewley-Port Measurement of formant transitions in naturally produced stop consonant-vowel syllables. , 1982, The Journal of the Acoustical Society of America.

[151]  L. Henderson Orthography and Word Recognition in Reading , 1982 .

[152]  J. G. Martin,et al.  Perception of anticipatory coarticulation effects in vowel-stop consonant-bowel sequences. , 1982, Journal of experimental psychology. Human perception and performance.

[153]  Sheri Hunnicutt,et al.  Bliss communication with speech or text output , 1982, ICASSP.

[154]  S. Hertz From text to speech with SRS , 1982 .

[155]  Dennis H. Klatt,et al.  Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.

[156]  Sheri Hunnicutt,et al.  A multi-language text-to-speech module , 1982, ICASSP.

[157]  D. Pisoni,et al.  Some comparisons of intelligibility of synthetic and natural speech at different speech‐to‐noise ratios , 1982 .

[158]  W. Strong,et al.  A model for the synthesis of natural sounding vowels , 1983 .

[159]  Kenneth Ward Church Phrase-structure parsing: a method for taking advantage of allophonic constraints , 1983 .

[160]  T.C.R.S. Fowler A reading machine for the blind , 1983 .

[161]  D. O'Shaughnessy,et al.  Linguistic modality effects on fundamental frequency in speech. , 1983, The Journal of the Acoustical Society of America.

[162]  Dik Lun Lee,et al.  Voice response systems , 1983, CSUR.

[163]  D. Ladd Phonological Features of Intonational Peaks , 1983 .

[164]  T. Feustel,et al.  Capacity Demands in Short-Term Memory for Synthetic and .Natural Speech , 1983, Human factors.

[165]  J. N. Holmes,et al.  Formant synthesizers: Cascade or parallel? , 1983, Speech Commun..

[166]  Robert T. Lund,et al.  University-to-industry advanced technology transfer: A case study , 1983 .

[167]  J. Bernstein,et al.  Fundamental frequency in sentence production , 1984, Proceedings of the IEEE.

[168]  J. N. Holmes,et al.  Implementation of a parallel-formant speech synthesiser using a single-chip programmable signal processor , 1984 .

[169]  P. W. Nye,et al.  Evolution of reading machines for the blind: Haskins Laboratories' research as a case history. , 1984, Journal of rehabilitation research and development.

[170]  Rolf Carlson,et al.  Swedish Speech Researchers Team Up with Electronic Ventrure Capitalists , 1984 .

[171]  T. Carrell Contributions of Fundamental Frequency, Formant Spacing, and Glottal Waveform to Talker Identification. Research on Speech Perception. Technical Report No. 5. , 1984 .

[172]  X. Rodet Time — Domain Formant — Wave — Function Synthesis , 1984 .

[173]  Hy Murveit,et al.  Telephone communication between deaf and hearing persons , 1984, ICASSP.

[174]  J. N. Holmes Speech Technology in the Next Decades , 1984 .

[175]  W. M. Rabinowitz,et al.  Standardization of a test of speech perception in noise. , 1979, Journal of speech and hearing research.

[176]  D. Klatt,et al.  Synthesis by rule of Japanese , 1984 .

[177]  Mark Liberman,et al.  Synthesis by rule of english intonation patterns , 1984, ICASSP '84. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[178]  Victor Zue,et al.  Properties of consonant sequences within words and across word boundaries , 1984, ICASSP.

[179]  J. Olive,et al.  Text to speech—An overview , 1985 .

[180]  C. Coker A dictionary‐intensive letter‐to‐sound program , 1985 .

[181]  D.B. Pisoni,et al.  Perception of synthetic speech generated by rule , 1985, Proceedings of the IEEE.

[182]  S.R. Hertz,et al.  The delta rule development system for speech synthesis from text , 1985, Proceedings of the IEEE.

[183]  George N. Clements,et al.  The geometry of phonological features , 1985, Phonology Yearbook.

[184]  E C Schwab,et al.  Some Effects of Training on the Perception of Synthetic Speech , 1985, Human factors.

[185]  C. Browman,et al.  Representation of voicing contrasts using articulatory gestures , 1986 .

[186]  D. Pisoni,et al.  Preference judgments comparing different synthetic voices , 1986 .

[187]  Richard K. Olson,et al.  Reading instruction and remediation with the aid of computer speech , 1986 .

[188]  B. Repp Perception of the [m]-[n] distinction in CV syllables. , 1986, The Journal of the Acoustical Society of America.

[189]  Michael J. Dedina,et al.  Comprehension of natural and synthetic speech using a sentence verification task , 1986 .

[190]  Kenneth Ward Church Stress assignment in letter to sound rules for speech synthesis , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[191]  C. Wright,et al.  Diagnostic evaluation of a synthesizer's acoustic inventory , 1986 .

[192]  G. Kopec,et al.  Network-based connected digit recognition using explicit acoustic-phonetic modeling , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[193]  S. Hunnicutt Lexical Prediction for A Text-to-Speech System , 1986 .

[194]  C. Nixon,et al.  The Perception of Synthetic Speech in Noise , 1986 .

[195]  D. Klatt Detailed spectral analysis of a female voice , 1986 .

[196]  Sheri Hunnicutt Bliss Symbol-to-Speech Conversion: "Blisstalk , 1986 .

[197]  Hiroya Fujisaki,et al.  Proposal and evaluation of models for the glottal source waveform , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[198]  Julia Hirschberg,et al.  The intonational Structuring of Discourse , 1986, ACL.

[199]  K. Stevens,et al.  Some Acoustical and Perceptual Correlates of Nasal Vowels , 1987 .

[200]  E. J. Lerner,et al.  Realism in synthetic speech , 1987 .

[201]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[202]  Terrence J. Sejnowski,et al.  NETtalk: a parallel network that learns to read aloud , 1988 .