Exploring features from natural language generation for prosody modeling

Prosody modeling is critical in developing a Concept-to-Speech (CTS) system where both Natural Language Generation (NLG) and Speech Synthesis are used to automatically generate natural, coherent speech. In this paper, we empirically verify the usefulness of various natural language features in prosody modeling. Three groups of features are investigated: semantic, syntactic, and surface features produced by SURGE, a general-purpose surface natural language generator for English, deep semantic, and discourse features that are available during the domain modeling and content planning phases of generation, and information-based measures statistically derived from text. Our experiments identify which of this large set of features are effective in prosody modeling. This work represents an important step towards building a comprehensive prosody model for CTS systems that employ general NLG. This investigation is conducted in the context of MAGIC, a medical application that involves automatic speech and graphics generation.

[1]  B. Altenberg Prosodic patterns in spoken English : studies in the correlation between prosody and grammar for text-to-speech conversion , 1990 .

[2]  Gillian R Brown,et al.  Prosodic Structure and the Given/New Distinction , 1983 .

[3]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[4]  Richard Sproat English noun-phrase accent prediction for text-to-speech , 1994, Comput. Speech Lang..

[5]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[6]  Julia Hirschberg,et al.  Automatic classification of intonational phrase boundaries , 1992 .

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Alan W. Black Comparison of algorithms for predicting accent placement in English speech synthesis. , 1995 .

[10]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[11]  Scott Prevost,et al.  A semantics of contrast and information structure for specifying intonation in spoken language generation , 1996 .

[12]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .

[13]  Paul Taylor,et al.  Using decision trees within the tilt intonation model to predict F0 contours , 1999, EUROSPEECH.

[14]  J. Pierrehumbert The phonology and phonetics of English intonation , 1987 .

[15]  D. Bolinger Accent Is Predictable (If You're a Mind-Reader) , 1972 .

[16]  Elisabeth Selkirk,et al.  Phonology and Syntax: The Relation between Sound and Structure , 1984 .

[17]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[18]  Richard Sproat,et al.  Multilingual Text-to-Speech Synthesis: The Bell Labs Approach , 1998, CL.

[19]  J. Bresnan Sentence Stress and Syntactic Transformations , 1971 .

[20]  Shimei Pan,et al.  Word Informativeness and Automatic Pitch Accent Modeling , 1999, EMNLP.

[21]  J. Olive,et al.  Text to speech—An overview , 1985 .

[22]  Eileen Fitzpatrick,et al.  A Computational Grammar of Discourse-Neutral Prosodic Phrasing in English , 1990, Comput. Linguistics.

[23]  Mari Ostendorf,et al.  The use of prosody in syntactic disambiguation , 1991 .

[24]  D. Bolinger A Theory of Pitch Accent in English , 1958 .

[25]  Anne Cutler,et al.  Prosody: Models and measurements , 1983 .

[26]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[27]  M. Liberman,et al.  The Stress and Structure of Modified Noun Phrases in English , 1992 .

[28]  Christine H. Nakatani,et al.  Constituent-based Accent Prediction , 1998, ACL.

[29]  E. Prince The ZPG Letter: Subjects, Definiteness, and Information-status , 1992 .

[30]  Julia Hirschberg,et al.  Training intonational phrasing rules automatically for English and Spanish text-to-speech , 1996, Speech Commun..

[31]  Philipp Koehn,et al.  Improving intonational phrasing with syntactic information , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[32]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[33]  Michael Elhadad,et al.  Using argumentation to control lexical choice: a functional unification implementation , 1993 .

[34]  Douglas D. OShaughnessy,et al.  Parsing with a Small Dictionary for Applications such as Text to Speech , 1989, Comput. Linguistics.

[35]  Steven K. Feiner,et al.  Negotiation for automated generation of temporal multimedia presentations , 1997, MULTIMEDIA '96.

[36]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .