论文信息 - Pronunciation modeling in speech synthesis

Pronunciation modeling in speech synthesis

This dissertation investigates the area of pronunciation modeling in speech synthesis. By pronunciation modeling, we mean architectures and principles for generating high-quality human-like pronunciations. The term pronunciation modeling has previously been applied in the context of speech recognition (e.g. Byrne et al. 1997). In that context, it describes theories and procedures for handling the pronunciation variation that naturally occurs across speakers. In contrast, our work is in the domain of text-to-speech synthesis, which, as we will show, requires modeling the pronunciation variation of an individual whose speech the synthesizer is attempting to model. We will explain our methodology for learning and reproducing pronunciation variation on an individual basis, and show how most crucial features of such variation can be easily generated using the architecture we describe. Throughout the course of this exposition, we highlight contributions to linguistic theory that such a thorough analysis of individual variation provides. We describe the postlexical module of an English text-to-speech synthesizer. This module is responsible for transforming underlying lexical pronunciations from a lexical database into contextually appropriate surface postlexical pronunciations. This transformation is achieved by machine learning of a corpus of hand-labeled postlexical pronunciations that have been aligned with lexical pronunciations. The machine learning is conducted by a neural network, whose architecture and data encoding we describe. A thorough analysis of the performance of the postlexical module is offered, with attention to the relative success of the neural network at learning a wide range of postlexical phenomena. We examine the extent to which a symbolic approach to allophony is warranted, and provide an acoustic analysis that attempts to provide an answer to this question. Assessments of the success of currently existing theories of phonetics, phonology and their interface are offered, based on the experience of generating a complete postlexical phonology of English for use in synthetic speech. Comments University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-98-09. This thesis or dissertation is available at ScholarlyCommons: http://repository.upenn.edu/ircs_reports/55 PRONUNCIATION MODELING IN SPEECH SYNTHESIS

Corey Miller | Corey Miller

[1] Yves Schabes,et al. Deterministic Part-of-Speech Tagging with Finite-State Transducers , 1995, Comput. Linguistics.

[2] M Liberman. Computer speech synthesis: its status and prospects. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[3] Mitch Weintraub,et al. Automatic Learning of Word Pronunciation from Data , 1996 .

[4] Rodney W. Johnson,et al. Letter-to-sound rules for automatic translation of english text to phonetics , 1976 .

[5] Steven Bird,et al. Computational phonology: A constraint-based approach , 1995, CL.

[6] J. Kruskal. An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules , 1983 .

[7] William Thomas Reynolds,et al. Variation and phonological theory , 1994 .

[8] David B. Pisoni,et al. Perceptual evaluation of MITalk: The MIT unrestricted text-to-speech system , 1980, ICASSP.

[9] JungIl Suh,et al. On the Variable Rules , 1983 .

[10] Richard Sproat,et al. Compilation of Weighted Finite-State Transducers from Decision Trees , 1996, ACL.

[11] Astrid Schmidt-Nielsen,et al. Intelligibility and Acceptability Testing for Speech Technology , 1992 .

[12] Gilbert Krulee,et al. Speaker understandability as a function of prosodic parameters , 1996 .

[13] T. Mark Ellison,et al. Phonological Derivation in Optimality Theory , 1994, COLING.

[14] Xuedong Huang,et al. Improvements on a trainable letter-to-sound converter , 1997, EUROSPEECH.

[15] Robert I. Damper,et al. Inference of letter-phoneme correspondences by delimiting and dynamic time warping techniques , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16] Andrew R. Golding. Pronouncing names by a combination of rule-based and case-based reasoning , 1992 .

[17] Thomas Clark Veatch,et al. English vowels : their surface phonology and phonetic implementation in vernacular dialects , 1991 .

[18] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[19] M. Coltheart. Lexical access in simple reading tasks , 1978 .

[20] Walter Daelemans,et al. The Acquisition of Stress: A Data-Oriented Approach , 1994, Comput. Linguistics.

[21] Mary Hare,et al. The Role of Similarity in Hungarian Vowel Harmony: a Connectionist Account , 1990 .

[22] D B Pisoni,et al. Segmental intelligibility of synthetic speech produced by rule. , 1989, The Journal of the Acoustical Society of America.

[23] Kenneth Ward Church. Phrase-structure parsing: a method for taking advantage of allophonic constraints , 1983 .

[24] Gregory R. Guy. Explanation in variable phonology: An exponential model of morphological constraints , 1991, Language Variation and Change.

[25] B. Hayes. Precompiled Phrasal Phonology , 2022 .

[26] Orhan Karaali,et al. Speech Synthesis with Neural Networks , 1998, ArXiv.

[27] V. Zue,et al. Acoustic study of medial /t,d/ in American English , 1979 .

[28] Merriam-Webster,et al. The Merriam Webster Dictionary , 1983 .

[29] Eugene Charniak,et al. Statistical language learning , 1997 .

[30] Louis Goldstein,et al. Gesture, Segment, Prosody: “Targetless” schwa: an articulatory analysis , 1992 .

[31] Yorick Wilks,et al. The Grammar of Sense: Is word-sense tagging much more than part-of-speech tagging? , 1996, ArXiv.

[32] Eleonora Cavalcante Albano,et al. Archisegment-based letter-to-phone conversion for concatenative speech synthesis in Portuguese , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[33] Charles Kenneth Thomas,et al. An Introduction to the Phonetics of American English. , 1959 .

[34] Bill Reynolds,et al. Variation and Optimality , 1994 .

[35] Kevin Knight,et al. Machine Transliteration , 1997, CL.

[36] Teuvo Kohonen,et al. The 'neural' phonetic typewriter , 1988, Computer.

[37] Dani Byrd,et al. Phonetic analyses of word and segment variation using the TIMIT corpus of American english , 1994, Speech Commun..

[38] R. A. Sharman,et al. A bi-directional model of English pronunciation , 1991, EUROSPEECH.

[39] H. Magen. The extent of vowel-to-vowel coarticulation in English and Japanese , 1997 .

[40] James J. Jenkins,et al. Recall of passages of synthetic speech , 1982 .

[41] John C. Wells. Accents of English 3: Preface , 1982 .

[42] M. Halle,et al. Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates , 1961 .

[43] Sidney A J Wood,et al. Assimilation or coarticulation? Evidence from the coordination of tongue gestures for the palatalization of Bulgarian alveolar stops. , 1996 .

[44] Florien J. van Beinum. The role of focus words in natural and in synthetic continuous speech: Acoustic aspects , 1992, Speech Commun..

[45] John J. Godfrey,et al. SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46] T. A. Knott,et al. A Pronouncing Dictionary of American English , 1944 .

[47] Sheila M. Williams. Lexical Phonology and Speech Style: Using a Model to Test a Theory , 1994, SIGMORPHON.

[48] J. Blevins. The Syllable in Phonological Theory , 1995 .

[49] William D. Marslen-Wilson,et al. A connectionist model of phonological representation in speech perception , 1995 .

[50] Paul Kiparsky,et al. Some consequences of Lexical Phonology , 1985, Phonology Yearbook.

[51] Jean-Claude Junqua,et al. Robustness in Automatic Speech Recognition , 1996 .

[52] Liliane Haegeman,et al. Introduction to Government and Binding Theory , 1991 .

[53] Richard Lippmann,et al. Recognition by humans and machines: miles to go before we sleep , 1996, Speech Commun..

[54] Robert L. Mercer,et al. An information theoretic approach to the automatic determination of phonemic baseforms , 1984, ICASSP.

[55] W. Fisher,et al. An acoustic‐phonetic data base , 1987 .

[56] Lotfi A. Zadeh,et al. Phonological structures for speech recognition , 1989 .

[57] David S. Touretzky,et al. Connectionist Models and Linguistic Theory: Investigations of Stress Systems in Language , 1993, Cogn. Sci..

[58] Michael Hammond,et al. Syllable parsing in English and French , 1995, ArXiv.

[59] Terrence J. Sejnowski,et al. Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[60] Paul Smolensky,et al. Optimality Theory: Constraint Interaction in Generative Grammar ; CU-CS-696-93 , 1993 .

[61] Alexander L. Francis,et al. Measuring the naturalness of synthetic speech , 1995, Int. J. Speech Technol..

[62] Merriam-Webster. Merriam-Webster's Collegiate Dictionary , 1998 .

[63] Arthur J. Bronstein,et al. The Pronunciation of American English , 1960 .

[64] John Coleman,et al. Stochastic phonological grammars and acceptability , 1997, SIGMORPHON@EACL.

[65] Louis Goldstein,et al. Towards an articulatory phonology , 1986, Phonology.

[66] Stefanie Shattuck-Hufnagel,et al. Glottalization of word-initial vowels as a function of prosodic structure , 1996 .

[67] David B. Pisoni,et al. Long-term memory in speech perception: Some new findings on talker variability, speaking rate and perceptual learning , 1993, Speech Commun..

[68] N I Durlach,et al. Speaking clearly for the hard of hearing I: Intelligibility differences between clear and conversational speech. , 1985, Journal of speech and hearing research.

[69] Walter Kintsch,et al. Toward a model of text comprehension and production. , 1978 .

[70] Francine R. Chen,et al. Computational Models of American Speech , 1992 .

[71] Robert I. Damper,et al. A recurrent network that learns to pronounce English text , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[72] Simon M. Lucas,et al. Syntactic neural networks for bidirectional text-phonetics translation , 1992 .

[73] A. W. F. Huggins,et al. The use of shibboleth words for automatically classifying speakers by dialect , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[74] Noel Massey,et al. Generating segment durations in a text-zo-speech system: a hybrid rule-based/neural network approach , 1997, EUROSPEECH.

[75] Mark A. Randolph,et al. Syllable-based constraints on properties of English sounds , 1989 .

[76] J. V. Leeuwen. Rational Transductions for Phonetic Conversion and Phonology , 1997 .

[77] Steve Young,et al. The HTK book , 1995 .

[78] Treebank Penn,et al. Linguistic Data Consortium , 1999 .

[79] Susan Fitt. The pronunciation of unfamiliar native and non-native town names , 1995, EUROSPEECH.

[80] Paul A. Luce,et al. Comprehension of fluent synthetic speech produced by rule , 1982 .

[81] Michael A. Covington,et al. An Algorithm to Align Words for Historical Comparison , 1996, Comput. Linguistics.

[82] Richard Sproat. English noun-phrase accent prediction for text-to-speech , 1994, Comput. Speech Lang..

[83] B. Hayes. Metrical Stress Theory: Principles and Case Studies , 1995 .

[84] G. Booij,et al. Postcyclic versus postlexical rules in lexical phonology , 1987 .

[85] Bernard Bloch,et al. The Syllabic Phonemes of English , 1941 .

[86] J. Goldsmith. Autosegmental and Metrical Phonology , 1990 .

[87] M. Yip. The obligatory contour principle and phonological rules: a loss of identity , 1988 .

[88] Jean-Pierre Martens,et al. Automatic Labeling of Corpora for Speech Synthesis Development , 1994 .

[89] Michael I. Jordan. Serial Order: A Parallel Distributed Processing Approach , 1997 .

[90] Jean-Pierre Martens,et al. Generation of Word Pronunciation Networks from automatically learned Inter-Word Coarticulation Rules , 1996 .

[91] Noel Massey,et al. A high quality text-to-speech system composed of multiple neural networks , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[92] D. Entwisle,et al. Language in the Inner City: Studies in the Black English Vernacular.@@@Sociolinguistic Patterns. , 1975 .

[93] Arthur J. Bronstein,et al. The Pronunciation of American English , 1961 .

[94] Walter Daelemans,et al. Language-Independent Data-Oriented Grapheme-to-Phoneme Conversion , 1996 .

[95] M. Picheny,et al. Speaking clearly for the hard of hearing. II: Acoustic characteristics of clear and conversational speech. , 1986, Journal of speech and hearing research.

[96] Maxine Eskénazi. Changing speech styles: strategies in read speech and casual and careful spontaneous speech , 1992, ICSLP.

[97] Chris Golston,et al. DIRECT OPTIMALITY THEORY : REPRESENTATION AS PURE MARKEDNESS , 1996 .

[98] Douglas D. O'Shaughnessy,et al. Modelling fundamental frequency, and its relationship to syntax, semantics, and phonetics , 1976 .

[99] Dani Byrd,et al. Relations of sex and dialect to reduction , 1994, Speech Communication.

[100] M. Halle,et al. Segmental phonology of Modern English , 1985 .

[101] Daniel Gildea,et al. Learning Bias and Phonological-Rule Induction , 1996, CL.

[102] Michael Riley,et al. Some Applications of Tree-based Modelling to Speech and Language , 1989, HLT.

[103] Teuvo Kohonen,et al. Self-Organization and Associative Memory , 1988 .

[104] Michael B. Broe. Specification theory : the treatment of redundancy in generative phonology , 1993 .

[105] Janet B. Pierrehumbert,et al. Synthesizing Allophonic Glottalization , 1997 .

[106] J. Pierrehumbert,et al. Japanese Tone Structure , 1988 .

[107] Jia-Wei Hong. On connectionist models , 1988 .

[108] W. Labov. The social stratification of English in New York City , 1969 .

[109] E C Schwab,et al. Some Effects of Training on the Perception of Synthetic Speech , 1985, Human factors.

[110] David B. Pisoni,et al. Text-to-speech: the mitalk system , 1987 .

[111] Daniel Jones,et al. The pronunciation of English , 1919 .

[112] K. D. Kryter,et al. ARTICULATION-TESTING METHODS: CONSONANTAL DIFFERENTIATION WITH A CLOSED-RESPONSE SET. , 1965, The Journal of the Acoustical Society of America.

[113] Elizabeth C. Zsiga. Phonology and Phonetic Evidence: An acoustic and electropalatographic study of lexical and postlexical palatalization in American English , 1995 .

[114] Diana Archangeli,et al. Aspects of underspecification theory , 1988, Phonology.

[115] Mervyn A. Jack,et al. Phonetic transcription standards for european names (ONOMASTICA) , 1993, EUROSPEECH.

[116] Gregory R. Guy,et al. Inherent variability and the obligatory contour principle , 1997, Language Variation and Change.

[117] Susan M. Mniszewski,et al. A Default Hierarchy for Pronouncing English , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[118] B. Van Coile. Inductive learning of pronunciation rules with the Depes system , 1991, ICASSP.

[119] B. Dresher,et al. A computational learning model for metrical phonology , 1990, Cognition.

[120] Steven Greenberg,et al. INSIGHTS INTO SPOKEN LANGUAGE GLEANED FROM PHONETIC TRANSCRIPTION OF THE SWITCHBOARD CORPUS , 1996 .

[121] John Nerbonne,et al. Measuring Dialect Distance Phonetically , 1997, SIGMORPHON@EACL.

[122] J. Pierce. An introduction to information theory: symbols, signals & noise , 1980 .

[123] D. J. Myers,et al. Neural Networks for Vision, Speech, and Natural Language , 1992 .

[124] Noam Chomsky,et al. The Sound Pattern of English , 1968 .

[125] Alex Waibel,et al. Prosody and speech recognition , 1988 .

[126] M. Picheny,et al. Speaking clearly for the hard of hearing , 1979 .

[127] Michael Kenstowicz,et al. Phonology In Generative Grammar , 1994 .

[128] Eric Laporte. Rational Transductions for Phonetic Conversion and Phonology , 1997 .

[129] David B. Pisoni,et al. Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics , 1996, Speech Commun..

[130] H. H. Hock. Principles of historical linguistics , 1986 .

[131] Patti J. Price,et al. Combining Linguistic with Statistical Methods in Automatic Speech Understanding , 1994 .

[132] Robert I. Damper,et al. A novel approach to inferring letter-phoneme correspondences , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[133] Christel Sorin,et al. Some observations on the processing of mute “e” in a French diphone-based speech synthesis system , 1991 .

[134] W. A. Ainsworth,et al. Applications of Multilayer Perceptrons in Text-To-Speech Synthesis Systems , 1992 .

[135] A. Woods,et al. Statistics in Language Studies , 1986 .

[136] D B Pisoni,et al. Comprehension of Synthetic Speech Produced by Rule: Word Monitoring and Sentence-by-Sentence Listening Times , 1991, Human factors.

[137] Heinz J. Giegerich,et al. English Phonology: An Introduction , 1992 .

[138] I. Lee Hetherington. A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding , 1995 .

[139] M. A. Randolph. A data-driven method for discovering and predicting allophonic variation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[140] Tony Vitale,et al. An Algorithm for High Accuracy Name Pronunciation by Parametric Speech Synthesizer , 1991, Comput. Linguistics.

[141] Joseph E. Grimes,et al. Information dependencies in lexical subentries , 1989 .

[142] Colin W. Wightman,et al. The aligner: text to speech alignment using Markov models and a pronunciation dictionary , 1994, SSW.

[143] David P. Gluch,et al. A very high-performance neural network system architecture using grouped weight quantization , 1989 .

[144] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[145] Stephanie Seneff,et al. Transcription and Alignment of the TIMIT Database , 1996 .

[146] Elisabeth Selkirk,et al. Phonology and syntax , 1984 .

[147] François Yvon. Grapheme-to-Phoneme Conversion using Multiple Unbounded Overlapping Chunks , 1996, ArXiv.

[148] Noel Massey,et al. Text-to-speech conversion with neural networks: a recurrent TDNN approach , 1998, EUROSPEECH.

[149] Mark Bedworth,et al. NETspeak — A re-implementation of NETtalk , 1987 .

[150] Sean A. Fulop,et al. Pronunciation variability in the Switchboard corpus , 1996 .

[151] Anthony Seeger. Guide to Pronunciation of Suya Words , 1981 .

[152] D. V. Bergem. Acoustic and Lexical Vowel Reduction , 1995 .

[153] Briony Williams,et al. A keyvowel approach to the synthesis of regional accents of English , 1997, EUROSPEECH.

[154] Osamu Fujimura,et al. Allophonic variation in English /l/ and its implications for phonetic implementation , 1993 .

[155] D. Bolinger. Two kinds of vowels, two kinds of rhythm , 1981 .

[156] C. Browman,et al. Papers in Laboratory Phonology: Tiers in articulatory phonology, with some implications for casual speech , 1990 .

[157] Daniel Jurafsky,et al. Learning Phonological Rule Probabilities from Speech Corpora with Exploratory Computational Phonology , 1995, ACL.

[158] Michael Riley,et al. A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[159] W. Labov,et al. Near-mergers and the suspension of phonemic contrast , 1991, Language Variation and Change.

[160] David Yarowsky,et al. Homograph Disambiguation in Text-to-Speech Synthesis , 1997 .

[161] Donald Fucci,et al. Synthetic Speech Comprehension , 1998 .

[162] Robert I. Damper,et al. Stochastic transduction for English text-to-phoneme conversion , 1991, EUROSPEECH.

[163] Michael Hammond,et al. Vowel quantity and syllabification in English , 1997 .

[164] David B. Pisoni,et al. Perception of Synthetic Speech , 1997 .

[165] Gitta P. M. Laan. The contribution of intonation, segmental durations, and spectral features to the perception of a spontaneous and a read speaking style , 1997, Speech Commun..

[166] Susan Fitt. The generation of regional pronunciations of English for speech synthesis , 1997, EUROSPEECH.

[167] Bert Van Coile. Inductive learning of grapheme-to-phoneme rules , 1990, ICSLP.

[168] G. Booij,et al. Yearbook of Morphology , 1988 .

[169] Sheri Hunnicutt,et al. A text-to-speech system for british English, and issues of dialect and style , 1987, ECST.

[170] A. Bell. Language style as audience design , 1984, Language in Society.

[171] Martin Kay,et al. Regular Models of Phonological Rule Systems , 1994, CL.

[172] L. Baum,et al. An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[173] Robert I. Damper. Self-learning and connectionist approaches to text-phoneme conversion , 1995 .

[174] Andrew Cohen. Developing a Nonsymbolic Phonetic Notation for Speech Synthesis , 1995, Comput. Linguistics.

[175] Noel Massey,et al. Variation and Synthetic Speech , 1997, ArXiv.

[176] Florien J. van Beinum. Spectro-temporal reduction and expansion in spontaneous speech and read text: the role of focus words , 1990, ICSLP.

[177] George K. Kokkinakis,et al. Efficient Multilingual Phoneme-to-Grapheme Conversion Based on HMM , 1996, CL.

[178] Bruce Tesar,et al. Computing Optimal Forms in Optimality Theory: Basic Syllabification ; CU-CS-763-95 , 2008 .

[179] Jeffrey D. Ullman,et al. Introduction to Automata Theory, Languages and Computation , 1979 .

[180] Markus Walther,et al. OT SIMPLE - a construction-kit approach to Optimality Theory implementation , 1996, ArXiv.

[181] J. P. Egan. Articulation testing methods , 1948, The Laryngoscope.

[182] Joseph Picone,et al. Automated generation of N-best pronunciations of proper nouns , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[183] Elizabeth Hume,et al. Front Vowels, Coronal Consonants and Their Interaction in Nonlinear Phonology , 1994 .

[184] K. P. Mohanan,et al. The Theory of Lexical Phonology , 1982 .

[185] Charles Kenneth Thomas,et al. An Introduction to the Phonetics of American English , 1947 .

[186] Jennifer M. Rodd,et al. Recurrent Neural-Network Learning of Phonological Regularities in Turkish , 1997, CoNLL.

[187] Astrid McHugh. Listener Preference and Comprehension Tests of Stress Algorithms for a Text-to-Phonetic Speech Synthesis Program. , 1976 .

[188] Howard C. Nusbaum,et al. Pronounce : a program for pronunciation by analogy , 1991 .

[189] Ulrich Ammon,et al. Sociolinguistics: An international handbook of the science of language and society (Project announcement) , 1984, Language in Society.

[190] CAROLE PARADIS,et al. ON CONSTRAINTS AND REPAIR STRATEGIES , 1987 .

[191] Eleonora Blaauw,et al. The contribution of prosodic boundary markers to the perceptual difference between read and spontaneous speech , 1994, Speech Commun..

[192] David R. Miller,et al. Statistical dialect classification based on mean phonetic features , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[193] Andrej Ljolje,et al. Automatic speech segmentation for concatenative inventory selection , 1994, SSW.

[194] R. Herold. Mechanisms of merger: The implementation and distribution of the low back merger in eastern Pennsylvania , 1990 .