论文信息 - A computational memory and processing model for prosody

A computational memory and processing model for prosody

This thesis links processing in working memory to prosody in speech, and links different working memory capacities to different prosodic styles. It provides a causal account of prosodic differences and an architecture for reproducing them in synthesized speech. The implemented system mediates text-based information through a model of attention and working memory. The main simulation parameter of the memory model quantifies recall. Changing its value changes what counts as given and new information in a text, and therefore determines the intonation with which the text is uttered. Other aspects of search and storage in the memory model are mapped to the remainder of the continuous and categorical features of pitch and timing, producing prosody in three different styles: for small recall values, the exaggerated and sing-song melodies of children's speech; for mid-range values, an adult expressive style: for the largest values, the prosody of a speaker who is familiar with the text, and at times sounds bored or irritated. In addition, because the storage procedure is stochastic, the prosody from simulation to simulation varies, even for identical control parameters. As with human speech, no two renditions are alike. Informal feedback indicates that the stylistic differences are recognizable and that the prosody is improved over current offerings. A comparison with natural data shows clear and predictable trends although not at significance. However, a comparison within the natural data also did not produce results at significance. One practical contribution of this work is a text mark-up schema consisting of relational annotations to grammatical structures. Another is the product—varied and plausible prosody in synthesized speech. The main theoretical contribution is to show that resource-bound cognitive activity has prosodic correlates, thus providing a rationale for the individual and stylistic differences in melody and rhythm that are ubiquitous in human speech. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Janet E. Cahn | Kenneth Haase

[1] Janet E. Cahn,et al. An investigation into the correlation of cue phrases, unfilled pauses and the structuring of spoken discourse , 1995, ArXiv.

[2] S G Nooteboom,et al. What Makes Speakers Omit Pitch Accents ? An Experiment , 1982, Phonetica.

[3] Bruce Hayes,et al. THE PROSODIC HIERARCHY IN METER , 1989 .

[4] T. Landauer. Memory without organization: Properties of a model with random storage and undirected retrieval , 1975, Cognitive Psychology.

[5] Anne Cutler. Prosody and the structure of the message , 1997 .

[6] Herbert H. Clark,et al. Grounding in communication , 1991, Perspectives on socially shared cognition.

[7] Julia Hirschberg,et al. Training intonational phrasing rules automatically for English and Spanish text-to-speech , 1996, Speech Commun..

[8] J. MacGregor. Short-term memory capacity: Limitation or optimization? , 1987 .

[9] Iain R. Murray,et al. Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[10] Florien J. van Beinum,et al. Pausing strategies in discourse in dutch , 1996, ICSLP.

[11] R. Ratcliff,et al. A retrieval theory of priming in memory. , 1988, Psychological review.

[12] D B Pisoni,et al. Comprehension of Synthetic Speech Produced by Rule: Word Monitoring and Sentence-by-Sentence Listening Times , 1991, Human factors.

[13] Carol A. Fowler,et al. Reductions of Spoken Words in Certain Discourse Contexts , 1997 .

[14] Anil S. Chakravarthy,et al. Toward Semantic Retrieval of Pictures and Video , 1994, RIAO.

[15] David Yarowsky,et al. A corpus-based synthesizer , 1992, ICSLP.

[16] Carl Pollard,et al. A Centering Approach to Pronouns , 1987, ACL.

[17] Dafydd Gibbon,et al. Intonation as an Adaptive Process , 1984 .

[18] Jordan B. Pollack,et al. Recursive Distributed Representations , 1990, Artif. Intell..

[19] James Paul Gee,et al. Performance structures: A psycholinguistic and linguistic appraisal , 1983, Cognitive Psychology.

[20] M. Baltin,et al. The Mental representation of grammatical relations , 1985 .

[21] J. Sachs. Recognition memory for syntactic and semantic aspects of connected discourse , 1967 .

[22] Paul Taylor,et al. Assigning intonation elements and prosodic phrasing for English speech synthesis from high level linguistic input , 1994, ICSLP.

[23] Julia Hirschberg,et al. Intonational Features of Local and Global Discourse Structure , 1992, HLT.

[24] Megumi Kameyama,et al. Stressed and Unstressed Pronouns: Complementary Preferences , 1997, ArXiv.

[25] Jacob Cohen. A Coefficient of Agreement for Nominal Scales , 1960 .

[26] D. Ladd. Declination ‘‘reset’’ and the hierarchical organization of utterances , 1988 .

[27] Klaus Krippendorff,et al. Content Analysis: An Introduction to Its Methodology , 1980 .

[28] John Bear,et al. A System for Labeling Self-Repairs in Speech , 1993 .

[29] H. Kucera,et al. Computational analysis of present-day American English , 1967 .

[30] Vincent J. van Heuven,et al. Declination in Dutch and Danish : global versus local pitch movements in the perceptual characterisation of sentence types , 1995 .

[31] L. Menn,et al. Fundamental Frequency and Discourse Structure , 1982 .

[32] G. A. Miller. THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[33] Keiichi Tokuda,et al. Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[34] Marilyn A. Walker,et al. Limited Attention and Discourse Structure , 1995, CL.

[35] Terrence J. Sejnowski,et al. Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[36] Marilyn A. Walker,et al. The Effect of Resource Limits and Task Complexity on Collaborative Planning in Dialogue , 1995, Artif. Intell..

[37] Susan R. Fussell,et al. Coordination of knowledge in communication: effects of speakers' assumptions about what others know. , 1992, Journal of personality and social psychology.

[38] Atro Voutilainen,et al. NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[39] Jan Edwards,et al. Papers in Laboratory Phonology: Lengthenings and shortenings and the nature of prosodic constituency , 1990 .

[40] C. Osgood,et al. Hesitation Phenomena in Spontaneous English Speech , 1959 .

[41] C. Fowler,et al. Talkers' signaling of new and old. words in speech and listeners' perception and use of the distinction , 1987 .

[42] Irene Vogel,et al. On clashes and lapses , 1989, Phonology.

[43] K. Stevens,et al. Emotions and speech: some acoustical correlates. , 1972, The Journal of the Acoustical Society of America.

[44] Peter A. Heeman,et al. A Computational Model of Collaboration on Referring Expressions , 1991 .

[45] Victoria A. Fromkin,et al. The Non-Anomalous Nature of Anomalous Utterances , 1971 .

[46] Bonnie Webber,et al. So what can we talk about now , 1986 .

[47] Mark Liberman,et al. Synthesis by rule of english intonation patterns , 1984, ICASSP '84. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48] B.S. Atal,et al. Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[49] Michael Halliday,et al. Cohesion in English , 1976 .

[50] S. Brennan. Seeking and providing evidence for mutual understanding , 1990 .

[51] Scott Prevost. Modeling contrast in the generation and synthesis of spoken language , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[52] Alex Pentland,et al. Modal Matching for Correspondence and Recognition , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[53] Jim Schenkein. Studies in the organization of conversational interaction , 1978 .

[54] Yoshinori Sagisaka,et al. Effect of speaking style on parameters of fundamental frequency contour , 1994 .

[55] Marilyn A. Walker,et al. Informational redundancy and resource bounds in dialogue , 1993 .

[56] Daniel C. O'Connell,et al. Critical Essays on Language Use and Psychology , 1988 .

[57] Goldman-Eisler Frieda. A Comparative Study of two Hesitation Phenomena , 1961 .

[58] Julia Hirschberg,et al. Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[59] Kim Binsted,et al. Character Design for Soccer Commentary , 1998, RoboCup.

[60] Janet E. Cahn,et al. The Effect of Pitch Accenting on Pronoun Referent Resolution , 1995, ACL.

[61] Penelope Sibun,et al. A Practical Part-of-Speech Tagger , 1992, ANLP.

[62] Julia Hirschberg,et al. Some intonational characteristics of discourse structure , 1992, ICSLP.

[63] D H Klatt,et al. Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[64] Robbert-Jan Beun,et al. Filled pauses as markers of discourse structure , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[65] M. Just,et al. From the SelectedWorks of Marcel Adam Just 1992 A capacity theory of comprehension : Individual differences in working memory , 2017 .

[66] R Ratcliff,et al. Retrieving information from memory: spreading-activation theories versus compound-cue theories. , 1994, Psychological review.

[67] Mari Ostendorf,et al. A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location , 1994, CL.

[68] Robert J. Gaizauskas,et al. Evaluating a Focus-Based Approach to Anaphora Resolution , 1998, COLING-ACL.

[69] A. Prince,et al. On stress and linguistic rhythm , 1977 .

[70] Roddy Cowie,et al. The forms and function of intonation in the phone voice , 1995 .

[71] John Bear,et al. Automatic Detection and Correction of Repairs in Human-Computer Dialog , 1992, HLT.

[72] Noam Chomsky. Some Concepts and Consequences of the Theory of Government and Binding , 1982 .

[73] N. Umeda. F0 Declination is situation dependent , 1980 .

[74] W. Cooper,et al. Speech intonation and focus location in matched statements and questions. , 1986, The Journal of the Acoustical Society of America.

[75] Masanobu Abe. Speaking Styles: Statistical Analysis and Synthesis by a Text-to-Speech System , 1997 .

[76] E. Schegloff,et al. A simplest systematics for the organization of turn-taking for conversation , 1974 .

[77] Marilyn A. Walker,et al. Mixed Initiative in Dialogue: An Investigation into Discourse Segmentation , 1990, ACL.

[78] T. V. Raman,et al. Audio System for Technical Readings , 1998, Lecture Notes in Computer Science.

[79] David Yarowsky,et al. DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.

[80] Scott Weinstein,et al. Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[81] Richard Sproat,et al. Multilingual Text-to-Speech Synthesis: The Bell Labs Approach , 1998, CL.

[82] J. Gee,et al. Prosodic structure and spoken word recognition , 1987, Cognition.

[83] Marilyn A. Walker,et al. When Given Information is Accented: Repetition, Paraphrase and Inference in Dialogue , 1993 .

[84] D. Robert Ladd,et al. Intonational phrasing: the case for recursive prosodic structure , 1986, Phonology.

[85] Mattias Heldner,et al. F0 declination in read-aloud and spontaneous speech , 1996, ICSLP.

[86] Julia Hirschberg,et al. Implicating Uncertainty: The Pragmatics of Fall-Rise Intonation , 1985 .

[87] Jean Carletta,et al. Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[88] D. Robert Ladd. Peak Features and Overall Slope , 1983 .

[89] J. Davitz,et al. The communication of emotional meaning , 1964 .

[90] J. Pierrehumbert,et al. Intonational structure in Japanese and English , 1986, Phonology.

[91] A. Bruce. Emotional Expression , 1883, The American Naturalist.

[92] Michael Gasser,et al. Networks that Learn about Phonological Feature Persistence , 1990 .

[93] Mari Ostendorf,et al. TOBI: a standard for labeling English prosody , 1992, ICSLP.

[94] E. Selkirk. On derived domains in sentence phonology , 1986, Phonology.

[95] A.I.C. Monaghan. Rhythm and stress-shift in speech synthesis , 1990 .

[96] Marilyn A. Walker,et al. Testing collaborative strategies by computational simulation: cognitive and task effects , 1995, Knowl. Based Syst..

[97] Christine H. Nakatani,et al. Discourse structural constraints on accent in narrative , 1994, SSW.

[98] W. Levelt,et al. Monitoring and self-repair in speech , 1983, Cognition.

[99] M. Halliday. NOTES ON TRANSITIVITY AND THEME IN ENGLISH. PART 2 , 1967 .

[100] T. Feustel,et al. Capacity Demands in Short-Term Memory for Synthetic and .Natural Speech , 1983, Human factors.

[101] S. Duncan,et al. Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[102] V. Yngve. On getting a word in edgewise , 1970 .

[103] Eileen Fitzpatrick,et al. A Computational Grammar of Discourse-Neutral Prosodic Phrasing in English , 1990, Comput. Linguistics.

[104] David Yarowsky,et al. Homograph Disambiguation in Text-to-Speech Synthesis , 1997 .

[105] Ellen F. Prince,et al. Toward a taxonomy of given-new information , 1981 .

[106] D. Klatt. Vowel Lengthening is Syntactically Determined in a Connected Discourse. , 1975 .

[107] Justine Cassell,et al. Semantic and Discourse Information for Text-to-Speech Intonation , 1997, Workshop On Concept To Speech Generation Systems.

[108] E. Schegloff. Discourse as an interactional achievement : Some uses of "Uh huh" and other things that come between sentences , 1982 .

[109] Mark Steedman,et al. Specifying intonation from context for speech synthesis , 1994, Speech Communication.

[110] G. Geffen,et al. Are the spoken durations of rare words longer than those of common words? , 1983, Memory & cognition.

[111] Candace L. Sidner,et al. Towards a computational theory of definite anaphora comprehension in English discourse , 1979 .

[112] Mariët Theune,et al. Contrastive accent in a data-to-speech system , 1997, ACL.

[113] Atro Voutilainen,et al. Specifying a shallow grammatical representation for parsing purposes , 1995, EACL.

[114] Candace L. Sidner,et al. Attention, Intentions, and the Structure of Discourse , 1986, CL.

[115] Grant Fairbanks,et al. Recent Experimental Investigations of Vocal Pitch in Speech , 1940 .

[116] Mari Ostendorf,et al. Automatic recognition of intonational features , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[117] Julia Hirschberg,et al. Now Let’s Talk About Now; Identifying Cue Phrases Intonationally , 1987, ACL.

[118] Mitchell P. Marcus,et al. A theory of syntactic recognition for natural language , 1979 .

[119] Julia Hirschberg,et al. Intonation and the Intentional Structure of Discourse , 1987, IJCAI.

[120] M. Steedman,et al. Syntax and intonational structure in a combinatory grammar , 1991 .

[121] Alphonse Chapanis,et al. The Effects of 10 Communication Modes on the Behavior of Teams During Co-Operative Problem-Solving , 1974, Int. J. Man Mach. Stud..

[122] G. Dell,et al. Adapting production to comprehension: The explicit mention of instruments , 1987, Cognitive Psychology.

[123] Anne Cutler,et al. Stress and accent in language production and understanding , 1984 .

[124] Douglas Douglas,et al. The multi-dimensional approach to linguistic analyses of genre variation: An overview of methodology and findings , 1992, Comput. Humanit..

[125] Noam Chomsky,et al. The Sound Pattern of English , 1968 .

[126] Cynthia A. McLemore,et al. The pragmatic interpretation of English intonation : sorority speech , 1991 .

[127] G. Fairbanks,et al. An experimental study of the pitch characteristics of the voice during the expression of emotion , 1939 .

[128] N. J. Youd,et al. The production of prosodic focus and contour in dialogue , 1993 .