Prosody modeling in concept-to-speech generation

With the development of speech recognition and synthesis technology, speech interfaces for practical applications are in high demand. For applications like spoken dialogues systems, where not only the waveform but also the content of a system's query/response have to be generated automatically, a Concept-to-Speech system is needed. One key module in a Concept-to-Speech system is prosody modeling. It determines how prosody (intonation), the suprasegmental aspect of speech that communicates the structure and meaning of utterances, should be represented and generated automatically. Since prosody directly affected by the meaning and structure of the sentences automatically produced by a natural language generator; at the same time, it also has significant influence on the naturalness and effectiveness of the speech synthesized, its performance is critical to the success of a Concept-to-Speech system where both natural language generation and speech synthesis are used together to generate the final spoken output. In this thesis, I focus on two aspects of the prosody modeling process. First, I explore novel features that are available during natural language generation, such as the meaning, structure, and context of sentences, and demonstrate how these features are related to prosody, based on empirical evidences derived from annotated speech corpora. Second, I propose a new prosody modeling approach that automatically combines different natural language features for prosody prediction. More specifically, I designed an augmented instance-based learning algorithm that makes use of the natural prosody in human speech to produce natural and vivid synthesized speech. Our subjective evaluation demonstrates the effectiveness of this approach. I implement the prosody modeling system for a medical application called MAGIC.

[1]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[2]  Alex I. C. Monaghan Intonation accent placement in a concept-to-dialogue system , 1994, SSW.

[3]  Richard Sproat Stress assignment in complex nominals for English text-to-speech , 1990, SSW.

[4]  James Shaw Clause Aggregation Using Linguistic Knowledge , 1998, INLG.

[5]  Mari Ostendorf,et al.  A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location , 1994, CL.

[6]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[7]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[8]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[9]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[10]  Julia Hirschberg,et al.  Automatic classification of intonational phrase boundaries , 1992 .

[11]  Cecile Paris User modelling in text generation , 1993 .

[12]  J. Bresnan Sentence Stress and Syntactic Transformations , 1971 .

[13]  F. Fallside,et al.  Speech synthesis from concept: A method for speech output from information systems , 1979 .

[14]  J. Olive,et al.  Text to speech—An overview , 1985 .

[15]  C. Habel,et al.  Language , 1931, NeuroImage.

[16]  Steven K. Feiner,et al.  Negotiation for automated generation of temporal multimedia presentations , 1997, MULTIMEDIA '96.

[17]  Françoise Emerard,et al.  Synthesis of Spoken Messages from Semantic Representations. Semantic-Representation-to-Speech System , 1986, COLING.

[18]  Stefanie Shattuck-Hufnagel,et al.  The Use of Prosody in Syntactic Disambiguation , 1991, HLT.

[19]  Adwait Ratnaparkhi,et al.  A Linear Observed Time Statistical Parser Based on Maximum Entropy Models , 1997, EMNLP.

[20]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[21]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[22]  Mark Steedman,et al.  Surface Structure, Intonation, and “Focus” , 1991 .

[23]  D. Robert Ladd,et al.  Intonational phrasing: the case for recursive prosodic structure , 1986, Phonology.

[24]  Julia Hirschberg,et al.  Assigning Intonational Features in Synthesized Spoken Directions , 1988, ACL.

[25]  Andrew Hunt A generalised model for utilising prosodic information in continuous speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Scott Prevost,et al.  A semantics of contrast and information structure for specifying intonation in spoken language generation , 1996 .

[27]  L. Streeter Acoustic determinants of phrase boundary perception. , 1978, The Journal of the Acoustical Society of America.

[28]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[29]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[30]  Shimei Pan,et al.  Word Informativeness and Automatic Pitch Accent Modeling , 1999, EMNLP.

[31]  Terry Winograd,et al.  Language as a Cognitive Process , 1983, CL.

[32]  Hercules Dalianis,et al.  Aggregation in Natural Language Generation , 1999 .

[33]  J. Davenport Editor , 1960 .

[34]  M. Liberman,et al.  The Stress and Structure of Modified Noun Phrases in English , 1992 .

[35]  L. Menn,et al.  Fundamental Frequency and Discourse Structure , 1982 .

[36]  R. N. Indah Language and Speech , 1958, Nature.

[37]  D. Bolinger Accent Is Predictable (If You're a Mind-Reader) , 1972 .

[38]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[39]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[40]  J. Pierrehumbert The phonology and phonetics of English intonation , 1987 .

[41]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[42]  Benoit Lavoie,et al.  A Fast and Portable Realizer for Text Generation Systems , 1997, ANLP.

[43]  Shimei Pan,et al.  Integrating Language Generation with Speech Synthesis in a Concept to Speech System , 1997 .

[44]  Julia Hirschberg,et al.  Deaccentuation of Words Representing ‘Given’ Information: Effects of Persistence of Grammatical Function and Surface Position , 1994 .

[45]  Owen Rambow,et al.  Applied Text Generation , 1992, ANLP.

[46]  Johanna D. Moore,et al.  Planning Text for Advisory Dialogues: Capturing Intentional and Rhetorical Information , 1993, CL.

[47]  William C. Mann,et al.  Rhetorical Structure Theory: Description and Construction of Text Structures , 1987 .

[48]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[49]  Verzekeren Naar Sparen,et al.  Cambridge , 1969, Humphrey Burton: In My Own Time.

[50]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[51]  Marilyn A. Walker,et al.  Evaluating Response Strategies in a Web-Based Spoken Dialogue Agent , 1998, ACL.

[52]  Julia Hirschberg,et al.  Intonational Features of Local and Global Discourse Structure , 1992, HLT.

[53]  J. Pierrehumbert,et al.  Intonational structure in Japanese and English , 1986, Phonology.

[54]  Wayne A. Lea,et al.  Trends in Speech Recognition , 1980 .

[55]  Julia Hirschberg,et al.  Accent and Discourse Context: Assigning Pitch Accent in Synthetic Speech , 1990, AAAI.

[56]  Gillian R Brown,et al.  Prosodic Structure and the Given/New Distinction , 1983 .

[57]  Jean Carletta Modelling Variations in Goal-Directed Dialogue , 1990, COLING.

[58]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[59]  Kathleen McKeown,et al.  Text generation: using discourse strategies and focus constraints to generate natural language text , 1985 .

[60]  Mari Ostendorf,et al.  Prediction of abstract prosodic labels for speech synthesis , 1996, Comput. Speech Lang..

[61]  Robin P. Fawcett The Computer Generation of Speech with Discoursally and Semantically Motivated Intonation , 1990, INLG.

[62]  Janet E. Cahn,et al.  A computational memory and processing model for prosody , 1999 .

[63]  S. A. Sherman,et al.  Providence , 1906 .

[64]  Alexander I. Rudnicky,et al.  Creating natural dialogs in the carnegie mellon communicator system , 1999, EUROSPEECH.

[65]  Richard Sproat English noun-phrase accent prediction for text-to-speech , 1994, Comput. Speech Lang..

[66]  Katherine D. Blake To San Francisco , 1911 .

[67]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[68]  Roger K. Moore Computer Speech and Language , 1986 .

[69]  Shimei Pan,et al.  Empirically Evaluating an Adaptable Spoken Dialogue System , 1999, ArXiv.

[70]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[71]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[72]  B. Altenberg Prosodic patterns in spoken English : studies in the correlation between prosody and grammar for text-to-speech conversion , 1990 .

[73]  Elke Teich,et al.  From communicative context to speech: Integrating dialogue processing, speech production and natural language generation , 1997, Speech Commun..

[74]  Victor Zue,et al.  Conversational interfaces: advances and challenges , 1997, Proceedings of the IEEE.

[75]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[76]  John A. Bateman From Systemic-Functional Grammar to Systemic-Functional Text Generation: Escalating the Exchange , 1990 .

[77]  Eileen Fitzpatrick,et al.  A Computational Grammar of Discourse-Neutral Prosodic Phrasing in English , 1990, Comput. Linguistics.

[78]  Christine H. Nakatani,et al.  Constituent-based Accent Prediction , 1998, ACL.

[79]  Christine Hisago Nakatani,et al.  The computational processing of intonational prominence: a functional prosody perspective , 1997 .

[80]  Association Focus , 1999 .

[81]  Eduard H. Hovy,et al.  Planning Coherent Multisentential Text , 1988, ACL.

[82]  R. Lathe Phd by thesis , 1988, Nature.

[83]  Colin W. Wightman Automatic detection of prosodic constituents for parsing , 1992 .

[84]  D. Bolinger A Theory of Pitch Accent in English , 1958 .

[85]  W. D. Wightman Philosophical Transactions of the Royal Society , 1961, Nature.

[86]  Eduard H. Hovy,et al.  Automated Discourse Generation Using Discourse Structure Relations , 1993, Artif. Intell..

[87]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[88]  Marilyn A. Walker,et al.  An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email , 2000, J. Artif. Intell. Res..

[89]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[90]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[91]  Alan W. Black Comparison of algorithms for predicting accent placement in English speech synthesis. , 1995 .

[92]  Alexander Ian Campbell Monaghan,et al.  Intonation in a text-to-speech conversion system , 1991 .

[93]  J. Fodor Psychology and Language. , 1970 .

[94]  S. Schmerling Aspects of English Sentence Stress , 1976 .

[95]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[96]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[97]  James R. Glass,et al.  Multilingual language generation across multiple domains , 1994, ICSLP.

[98]  Jack Goody,et al.  Harper and Row. , 1995 .

[99]  Phillip Taylor,et al.  Concept-to-speech synthesis by phonological structure matching , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[100]  Chris Buckley,et al.  Implementation of the SMART Information Retrieval System , 1985 .

[101]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[102]  Ray Jackendoff,et al.  Semantic Interpretation in Generative Grammar , 1972 .

[103]  A. Karimi,et al.  Master‟s thesis , 2011 .

[104]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[105]  Christian M. I. M. Matthiessen,et al.  Text Generation and Systemic-Functional Linguistics: Experiences from English and Japanese , 1992 .

[106]  Mari Ostendorf,et al.  Automatic recognition of intonational features , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[107]  Richard Sproat,et al.  Multilingual Text-to-Speech Synthesis: The Bell Labs Approach , 1998, CL.

[108]  Mark Steedman,et al.  Dependency and Coordination in the Grammar of Dutch and English , 1985 .

[109]  W. Cooper,et al.  Speech intonation and focus location in matched statements and questions. , 1986, The Journal of the Acoustical Society of America.

[110]  Michael Elhadad,et al.  Using argumentation to control lexical choice: a functional unification implementation , 1993 .

[111]  D. Bolinger Contrastive Accent and Contrastive Stress , 1961 .

[112]  J. Kellett London , 1914, The Hospital.

[113]  Michael Halliday,et al.  Cohesion in English , 1976 .

[114]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.