A computational memory and processing model for prosody

This thesis links processing in working memory to prosody in speech, and links different working memory capacities to different prosodic styles. It provides a causal account of prosodic differences and an architecture for reproducing them in synthesized speech. The implemented system mediates text-based information through a model of attention and working memory. The main simulation parameter of the memory model quantifies recall. Changing its value changes what counts as given and new information in a text, and therefore determines the intonation with which the text is uttered. Other aspects of search and storage in the memory model are mapped to the remainder of the continuous and categorical features of pitch and timing, producing prosody in three different styles: for small recall values, the exaggerated and sing-song melodies of children's speech; for mid-range values, an adult expressive style: for the largest values, the prosody of a speaker who is familiar with the text, and at times sounds bored or irritated. In addition, because the storage procedure is stochastic, the prosody from simulation to simulation varies, even for identical control parameters. As with human speech, no two renditions are alike. Informal feedback indicates that the stylistic differences are recognizable and that the prosody is improved over current offerings. A comparison with natural data shows clear and predictable trends although not at significance. However, a comparison within the natural data also did not produce results at significance. One practical contribution of this work is a text mark-up schema consisting of relational annotations to grammatical structures. Another is the product—varied and plausible prosody in synthesized speech. The main theoretical contribution is to show that resource-bound cognitive activity has prosodic correlates, thus providing a rationale for the individual and stylistic differences in melody and rhythm that are ubiquitous in human speech. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Janet E. Cahn,et al.  An investigation into the correlation of cue phrases, unfilled pauses and the structuring of spoken discourse , 1995, ArXiv.

[2]  S G Nooteboom,et al.  What Makes Speakers Omit Pitch Accents ? An Experiment , 1982, Phonetica.

[3]  Bruce Hayes,et al.  THE PROSODIC HIERARCHY IN METER , 1989 .

[4]  T. Landauer Memory without organization: Properties of a model with random storage and undirected retrieval , 1975, Cognitive Psychology.

[5]  Anne Cutler Prosody and the structure of the message , 1997 .

[6]  Herbert H. Clark,et al.  Grounding in communication , 1991, Perspectives on socially shared cognition.

[7]  Julia Hirschberg,et al.  Training intonational phrasing rules automatically for English and Spanish text-to-speech , 1996, Speech Commun..

[8]  J. MacGregor Short-term memory capacity: Limitation or optimization? , 1987 .

[9]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[10]  Florien J. van Beinum,et al.  Pausing strategies in discourse in dutch , 1996, ICSLP.

[11]  R. Ratcliff,et al.  A retrieval theory of priming in memory. , 1988, Psychological review.

[12]  D B Pisoni,et al.  Comprehension of Synthetic Speech Produced by Rule: Word Monitoring and Sentence-by-Sentence Listening Times , 1991, Human factors.

[13]  Carol A. Fowler,et al.  Reductions of Spoken Words in Certain Discourse Contexts , 1997 .

[14]  Anil S. Chakravarthy,et al.  Toward Semantic Retrieval of Pictures and Video , 1994, RIAO.

[15]  David Yarowsky,et al.  A corpus-based synthesizer , 1992, ICSLP.

[16]  Carl Pollard,et al.  A Centering Approach to Pronouns , 1987, ACL.

[17]  Dafydd Gibbon,et al.  Intonation as an Adaptive Process , 1984 .

[18]  Jordan B. Pollack,et al.  Recursive Distributed Representations , 1990, Artif. Intell..

[19]  James Paul Gee,et al.  Performance structures: A psycholinguistic and linguistic appraisal , 1983, Cognitive Psychology.

[20]  M. Baltin,et al.  The Mental representation of grammatical relations , 1985 .

[21]  J. Sachs Recognition memory for syntactic and semantic aspects of connected discourse , 1967 .

[22]  Paul Taylor,et al.  Assigning intonation elements and prosodic phrasing for English speech synthesis from high level linguistic input , 1994, ICSLP.

[23]  Julia Hirschberg,et al.  Intonational Features of Local and Global Discourse Structure , 1992, HLT.

[24]  Megumi Kameyama,et al.  Stressed and Unstressed Pronouns: Complementary Preferences , 1997, ArXiv.

[25]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[26]  D. Ladd Declination ‘‘reset’’ and the hierarchical organization of utterances , 1988 .

[27]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[28]  John Bear,et al.  A System for Labeling Self-Repairs in Speech , 1993 .

[29]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[30]  Vincent J. van Heuven,et al.  Declination in Dutch and Danish : global versus local pitch movements in the perceptual characterisation of sentence types , 1995 .

[31]  L. Menn,et al.  Fundamental Frequency and Discourse Structure , 1982 .

[32]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[33]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[34]  Marilyn A. Walker,et al.  Limited Attention and Discourse Structure , 1995, CL.

[35]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[36]  Marilyn A. Walker,et al.  The Effect of Resource Limits and Task Complexity on Collaborative Planning in Dialogue , 1995, Artif. Intell..

[37]  Susan R. Fussell,et al.  Coordination of knowledge in communication: effects of speakers' assumptions about what others know. , 1992, Journal of personality and social psychology.

[38]  Atro Voutilainen,et al.  NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[39]  Jan Edwards,et al.  Papers in Laboratory Phonology: Lengthenings and shortenings and the nature of prosodic constituency , 1990 .

[40]  C. Osgood,et al.  Hesitation Phenomena in Spontaneous English Speech , 1959 .

[41]  C. Fowler,et al.  Talkers' signaling of new and old. words in speech and listeners' perception and use of the distinction , 1987 .

[42]  Irene Vogel,et al.  On clashes and lapses , 1989, Phonology.

[43]  K. Stevens,et al.  Emotions and speech: some acoustical correlates. , 1972, The Journal of the Acoustical Society of America.

[44]  Peter A. Heeman,et al.  A Computational Model of Collaboration on Referring Expressions , 1991 .

[45]  Victoria A. Fromkin,et al.  The Non-Anomalous Nature of Anomalous Utterances , 1971 .

[46]  Bonnie Webber,et al.  So what can we talk about now , 1986 .

[47]  Mark Liberman,et al.  Synthesis by rule of english intonation patterns , 1984, ICASSP '84. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  B.S. Atal,et al.  Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[49]  Michael Halliday,et al.  Cohesion in English , 1976 .

[50]  S. Brennan Seeking and providing evidence for mutual understanding , 1990 .

[51]  Scott Prevost Modeling contrast in the generation and synthesis of spoken language , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[52]  Alex Pentland,et al.  Modal Matching for Correspondence and Recognition , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[53]  Jim Schenkein Studies in the organization of conversational interaction , 1978 .

[54]  Yoshinori Sagisaka,et al.  Effect of speaking style on parameters of fundamental frequency contour , 1994 .

[55]  Marilyn A. Walker,et al.  Informational redundancy and resource bounds in dialogue , 1993 .

[56]  Daniel C. O'Connell,et al.  Critical Essays on Language Use and Psychology , 1988 .

[57]  Goldman-Eisler Frieda A Comparative Study of two Hesitation Phenomena , 1961 .

[58]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[59]  Kim Binsted,et al.  Character Design for Soccer Commentary , 1998, RoboCup.

[60]  Janet E. Cahn,et al.  The Effect of Pitch Accenting on Pronoun Referent Resolution , 1995, ACL.

[61]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[62]  Julia Hirschberg,et al.  Some intonational characteristics of discourse structure , 1992, ICSLP.

[63]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[64]  Robbert-Jan Beun,et al.  Filled pauses as markers of discourse structure , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[65]  M. Just,et al.  From the SelectedWorks of Marcel Adam Just 1992 A capacity theory of comprehension : Individual differences in working memory , 2017 .

[66]  R Ratcliff,et al.  Retrieving information from memory: spreading-activation theories versus compound-cue theories. , 1994, Psychological review.

[67]  Mari Ostendorf,et al.  A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location , 1994, CL.

[68]  Robert J. Gaizauskas,et al.  Evaluating a Focus-Based Approach to Anaphora Resolution , 1998, COLING-ACL.

[69]  A. Prince,et al.  On stress and linguistic rhythm , 1977 .

[70]  Roddy Cowie,et al.  The forms and function of intonation in the phone voice , 1995 .

[71]  John Bear,et al.  Automatic Detection and Correction of Repairs in Human-Computer Dialog , 1992, HLT.

[72]  Noam Chomsky Some Concepts and Consequences of the Theory of Government and Binding , 1982 .

[73]  N. Umeda F0 Declination is situation dependent , 1980 .

[74]  W. Cooper,et al.  Speech intonation and focus location in matched statements and questions. , 1986, The Journal of the Acoustical Society of America.

[75]  Masanobu Abe Speaking Styles: Statistical Analysis and Synthesis by a Text-to-Speech System , 1997 .

[76]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 1974 .

[77]  Marilyn A. Walker,et al.  Mixed Initiative in Dialogue: An Investigation into Discourse Segmentation , 1990, ACL.

[78]  T. V. Raman,et al.  Audio System for Technical Readings , 1998, Lecture Notes in Computer Science.

[79]  David Yarowsky,et al.  DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.

[80]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[81]  Richard Sproat,et al.  Multilingual Text-to-Speech Synthesis: The Bell Labs Approach , 1998, CL.

[82]  J. Gee,et al.  Prosodic structure and spoken word recognition , 1987, Cognition.

[83]  Marilyn A. Walker,et al.  When Given Information is Accented: Repetition, Paraphrase and Inference in Dialogue , 1993 .

[84]  D. Robert Ladd,et al.  Intonational phrasing: the case for recursive prosodic structure , 1986, Phonology.

[85]  Mattias Heldner,et al.  F0 declination in read-aloud and spontaneous speech , 1996, ICSLP.

[86]  Julia Hirschberg,et al.  Implicating Uncertainty: The Pragmatics of Fall-Rise Intonation , 1985 .

[87]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[88]  D. Robert Ladd Peak Features and Overall Slope , 1983 .

[89]  J. Davitz,et al.  The communication of emotional meaning , 1964 .

[90]  J. Pierrehumbert,et al.  Intonational structure in Japanese and English , 1986, Phonology.

[91]  A. Bruce Emotional Expression , 1883, The American Naturalist.

[92]  Michael Gasser,et al.  Networks that Learn about Phonological Feature Persistence , 1990 .

[93]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[94]  E. Selkirk On derived domains in sentence phonology , 1986, Phonology.

[95]  A.I.C. Monaghan Rhythm and stress-shift in speech synthesis , 1990 .

[96]  Marilyn A. Walker,et al.  Testing collaborative strategies by computational simulation: cognitive and task effects , 1995, Knowl. Based Syst..

[97]  Christine H. Nakatani,et al.  Discourse structural constraints on accent in narrative , 1994, SSW.

[98]  W. Levelt,et al.  Monitoring and self-repair in speech , 1983, Cognition.

[99]  M. Halliday NOTES ON TRANSITIVITY AND THEME IN ENGLISH. PART 2 , 1967 .

[100]  T. Feustel,et al.  Capacity Demands in Short-Term Memory for Synthetic and .Natural Speech , 1983, Human factors.

[101]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[102]  V. Yngve On getting a word in edgewise , 1970 .

[103]  Eileen Fitzpatrick,et al.  A Computational Grammar of Discourse-Neutral Prosodic Phrasing in English , 1990, Comput. Linguistics.

[104]  David Yarowsky,et al.  Homograph Disambiguation in Text-to-Speech Synthesis , 1997 .

[105]  Ellen F. Prince,et al.  Toward a taxonomy of given-new information , 1981 .

[106]  D. Klatt Vowel Lengthening is Syntactically Determined in a Connected Discourse. , 1975 .

[107]  Justine Cassell,et al.  Semantic and Discourse Information for Text-to-Speech Intonation , 1997, Workshop On Concept To Speech Generation Systems.

[108]  E. Schegloff Discourse as an interactional achievement : Some uses of "Uh huh" and other things that come between sentences , 1982 .

[109]  Mark Steedman,et al.  Specifying intonation from context for speech synthesis , 1994, Speech Communication.

[110]  G. Geffen,et al.  Are the spoken durations of rare words longer than those of common words? , 1983, Memory & cognition.

[111]  Candace L. Sidner,et al.  Towards a computational theory of definite anaphora comprehension in English discourse , 1979 .

[112]  Mariët Theune,et al.  Contrastive accent in a data-to-speech system , 1997, ACL.

[113]  Atro Voutilainen,et al.  Specifying a shallow grammatical representation for parsing purposes , 1995, EACL.

[114]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[115]  Grant Fairbanks,et al.  Recent Experimental Investigations of Vocal Pitch in Speech , 1940 .

[116]  Mari Ostendorf,et al.  Automatic recognition of intonational features , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[117]  Julia Hirschberg,et al.  Now Let’s Talk About Now; Identifying Cue Phrases Intonationally , 1987, ACL.

[118]  Mitchell P. Marcus,et al.  A theory of syntactic recognition for natural language , 1979 .

[119]  Julia Hirschberg,et al.  Intonation and the Intentional Structure of Discourse , 1987, IJCAI.

[120]  M. Steedman,et al.  Syntax and intonational structure in a combinatory grammar , 1991 .

[121]  Alphonse Chapanis,et al.  The Effects of 10 Communication Modes on the Behavior of Teams During Co-Operative Problem-Solving , 1974, Int. J. Man Mach. Stud..

[122]  G. Dell,et al.  Adapting production to comprehension: The explicit mention of instruments , 1987, Cognitive Psychology.

[123]  Anne Cutler,et al.  Stress and accent in language production and understanding , 1984 .

[124]  Douglas Douglas,et al.  The multi-dimensional approach to linguistic analyses of genre variation: An overview of methodology and findings , 1992, Comput. Humanit..

[125]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[126]  Cynthia A. McLemore,et al.  The pragmatic interpretation of English intonation : sorority speech , 1991 .

[127]  G. Fairbanks,et al.  An experimental study of the pitch characteristics of the voice during the expression of emotion , 1939 .

[128]  N. J. Youd,et al.  The production of prosodic focus and contour in dialogue , 1993 .

[129]  Jean-Pierre Corriveau Time-constrained Memory: A Reader-based Approach To Text Comprehension , 1995 .

[130]  Philip Resnik,et al.  Disambiguating Noun Groupings with Respect to Wordnet Senses , 1995, VLC@ACL.

[131]  Julia Hirschberg,et al.  The intonational Structuring of Discourse , 1986, ACL.

[132]  Gary S. Dell,et al.  The retrieval of phonological forms in production: tests of predictions from a connectionist model , 1988 .

[133]  Martin Haran,et al.  Speech errors and task demand , 1992, ICSLP.

[134]  Merle Horne,et al.  Referent tracking in restricted texts using a lemmatized lexicon: implications for generation of intonation , 1993, EUROSPEECH.

[135]  Julia Hirschberg,et al.  Automatic classification of intonational phrase boundaries , 1992 .

[136]  Eneko Agirre,et al.  Word Sense Disambiguation using Conceptual Density , 1996, COLING.

[137]  Mark Newson,et al.  乔姆斯基的普遍语法教程 = Chomsky's universal grammar : an introduction : , 1988 .

[138]  Geoffrey E. Hinton,et al.  Distributed representations and nested compositional structure , 1994 .

[139]  Rolf Carlson,et al.  Experiments with emotive speech - acted utterances and synthesized replicas , 1992, ICSLP.

[140]  Janet E. Cahn The Generation of A ect in Synthesized Speech , 1990 .

[141]  Jill House,et al.  Generating intonation in a voice dialogue system , 1991, EUROSPEECH.

[142]  Victor Zue,et al.  The Collection and Preliminary Analysis of a Spontaneous Speech Database , 1989, HLT.

[143]  Julia Hirschberg,et al.  Accent and Discourse Context: Assigning Pitch Accent in Synthetic Speech , 1990, AAAI.

[144]  Irene Vogel,et al.  Prosodic Structure Above the Word , 1983 .

[145]  James Raymond Davis Back seat driver : voice assisted automobile navigation , 1989 .

[146]  S. Brennan Centering Attention in Discourse. , 1995 .

[147]  Mari Ostendorf,et al.  Prediction of abstract prosodic labels for speech synthesis , 1996, Comput. Speech Lang..

[148]  Gillian R Brown,et al.  Prosodic Structure and the Given/New Distinction , 1983 .

[149]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[150]  Colin W. Wightman,et al.  Segmental durations in the vicinity of prosodic phrase boundaries. , 1992, The Journal of the Acoustical Society of America.

[151]  Richard T. Oehrle,et al.  Prosodic Constraints on Dynamic Grammatical Analysis , 1991 .

[152]  A. Cutler,et al.  Malapropisms and the structure of the mental lexicon , 1977 .

[153]  Stefanie Shattuck-Hufnagel,et al.  The representation of phonological information during speech production planning:evidence from vowel errors in spontaneous speech , 1986, Phonology.

[154]  M. Swerts,et al.  Prosody as a Marker of Information Flow in Spoken Discourse , 1994 .

[155]  John F. Pitrelli,et al.  Towards using Prosody In Speech Recognition/Understanding Systems: Differences Between Read and Spontaneous Speech , 1992, HLT.

[156]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[157]  Janet E. Cahn Generating expression in synthesized speech , 1989 .

[158]  John A. Waterworth Why is synthetic speech harder to remember than natural speech? , 1985, CHI '85.

[159]  Steven Bird Focus and Phrasing in Unification Categorial Grammar , 1991 .

[160]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[161]  Douglas D. O'Shaughnessy Analysis of false starts in spontaneous speech , 1992, ICSLP.

[162]  J. Terken The Distribution of Pitch Accents in Instructions as a Function of Discourse Structure , 1984 .

[163]  Eneko Agirre,et al.  Disambiguating bilingual nominal entries against WordNet , 1995, ArXiv.

[164]  Thomas Wasow End-Weight from the Speaker's Perspective , 1997 .

[165]  Janet E. Cahn From Sad to Glad : Emotional Computer Voices , .

[166]  Gary S. Dell,et al.  Positive Feedback in Hierarchical Connectionist Models: Applications to Language Production , 1988, Cogn. Sci..

[167]  Julia Hirschberg,et al.  Assigning Intonational Features in Synthesized Spoken Directions , 1988, ACL.

[168]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[169]  M. E. Johnson,et al.  Synthesis of English intonation using explicit models of reading and spontaneous speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[170]  Yoshinori Sagisaka,et al.  Effect of speaking style on parameters of fundamental frequency contour , 1994, SSW.

[171]  D. Bolinger A Theory of Pitch Accent in English , 1958 .

[172]  J. Pierrehumbert The phonology and phonetics of English intonation , 1987 .

[173]  Patricia K. Kuhl,et al.  The acoustic structure of vowels in mothers' speech to infants and adults , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[174]  A. Fernald,et al.  Expanded Intonation Contours in Mothers' Speech to Newborns. , 1984 .

[175]  David S. Touretzky,et al.  Connectionist Models and Linguistic Theory: Investigations of Stress Systems in Language , 1993, Cogn. Sci..

[176]  Shimei Pan,et al.  Integrating Language Generation with Speech Synthesis in a Concept to Speech System , 1997 .

[177]  Julia Hirschberg,et al.  Deaccentuation of Words Representing ‘Given’ Information: Effects of Persistence of Grammatical Function and Surface Position , 1994 .

[178]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[179]  Marilyn A. Walker,et al.  Improvising linguistic style: social and affective bases for agent personality , 1997, AGENTS '97.

[180]  S. Pinker The Language Instinct , 1994 .