Tagging prosody and discourse structure in elicited spontaneous speech

The development of a large spontaneous speech Japanese language corpus under the sponsorship of the Ministry of Posts and Telecommunications is a signal event in the illustrious history of speech technology in this country. Japanese laboratories have been at the forefront in the development of key parts of current automatic speech recogn ition (ASR) and text-to-sp eech (TTS) technology — e.g., the use of variable-length units in concatenative speech synthesis [40]. Because of such contributions in many laboratories both in Japan and elsewhere, speech technology today is at a stage where two more complex and difficult challenges can begin to be addressed seriously. Large vocabulary ASR systems have good word recogn ition rates even for conti nuous speech, and our emphasis now can turn to integrating ASR fully with natural language parsing (NLP) technology in order to try to build complete spoken language understanding systems. Also, the basic algorithms for TTS are now good enough that we can begin to integrate them with NLP technology to design complete spoken language generation systems, to try to generate comprehensible dialogues and not just strings of individually inte lligible sentences.

[1]  Julia Hirschberg,et al.  Implicating Uncertainty: The Pragmatics of Fall-Rise Intonation , 1985 .

[2]  J. McCawley The phonological component of a grammar of Japanese , 1968 .

[3]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[4]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[5]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[6]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[7]  Marilyn A. Walker,et al.  Japanese Discourse and the Process of Centering , 1994, Comput. Linguistics.

[8]  Paul Taylor Automatic recognition of intonation from F0 contours using the rise/fall/connection model , 1993, EUROSPEECH.

[9]  Elisabeth Selkirk,et al.  Phonology and Syntax: The Relation between Sound and Structure , 1984 .

[10]  Mari Ostendorf,et al.  The Need for Increased Speech Synthesis Research: Report of the 1998 NSF Workshop for Discussing Research Priorities and Evaluation Strategies in Speech Synthesis , 1999 .

[11]  Sanae Eda Identification and discrimination of syntactically and pragmatically contrasting intonation patterns by native and non-native speakers of standard Japanese , 2000, INTERSPEECH.

[12]  Jan P. H. van Santen,et al.  Modeling Japanese boundary pitch movements for speech synthesis , 1998, SSW.

[13]  Raj Reddy,et al.  Automatic Speech Recognition: The Development of the Sphinx Recognition System , 1988 .

[14]  Christine Hisago Nakatani,et al.  The computational processing of intonational prominence: a functional prosody perspective , 1997 .

[15]  Haruo Kubozono The organization of Japanese prosody , 1987 .

[16]  J. Pierrehumbert,et al.  Japanese Tone Structure , 1988 .

[17]  Michael Riley,et al.  Some Applications of Tree-based Modelling to Speech and Language , 1989, HLT.

[18]  J. Venditti Japanese ToBI Labelling Guidelines , 1997 .

[19]  I. Mattingly Synthesis by Rule of Prosodic Features , 1966 .

[20]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21]  S. Haraguchi,et al.  The tone pattern of Japanese : an autosegmental theory of tonology. , 1979 .

[22]  Julia Hirschberg,et al.  Some intonational characteristics of discourse structure , 1992, ICSLP.

[23]  Ann K. Syrdal,et al.  Using tone similarity judgments in tests of intertranscriber reliability , 1999 .

[24]  Mari Ostendorf,et al.  A Multi-level Model for Recognition of Intonation Labels , 1997, Computing Prosody.

[25]  Victor Zue,et al.  Empirical evaluation of human performance and agreement in parsing discourse constituents in spoken dialogue , 1995, EUROSPEECH.

[26]  D. Crystal,et al.  Intonation and Grammar in British English , 1967 .

[27]  Janet E. Cahn,et al.  The Effect of Pitch Accenting on Pronoun Referent Resolution , 1995, ACL.

[28]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[29]  Julia Hirschberg,et al.  Instructions for annotating discourse , 1995 .

[30]  Kikuo Maekawa Phonetic and phonological characteristics of paralinguistic information in spoken Japanese , 1998, ICSLP.

[31]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[32]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[33]  Julia Hirschberg,et al.  Automatic ToBI prediction and alignment to speed manual labeling of prosody , 2001, Speech Commun..

[34]  J. Goldsmith,et al.  The Structure of Intonational Meaning: Evidence from English , 1982 .

[35]  Julia Hirschberg,et al.  The Influence of Pitch Range, Duration, Amplitude and Spectral Features on the Interpretation of the Rise-Fall-Rise Intonation Contour in English , 1992 .

[36]  Marc Swerts,et al.  Intonational cues to discourse structure in Japanese , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[37]  I. Lehiste Phonetic Disambiguation of Syntactic Ambiguity , 1973 .