Automatic ToBI prediction and alignment to speed manual labeling of prosody

Tagging of corpora for useful linguistic categories can be a time-consuming process, especially with linguistic categories for which annotation standards are relatively new, such as discourse segment boundaries or the intonational events marked in the Tones and Break Indices (ToBI) system for American English. A ToBI prosodic labeling of speech typically takes even experienced labelers from 100 to 200 times real time. An experiment was conducted to determine (1) whether manual correction of automatically assigned ToBI labels would speed labeling, and (2) whether default labels introduced any bias in label assignment. A large speech corpus of one female speaker reading several types of texts was automatically assigned default labels. Default accent placement and phrase boundary location were predicted from text using machine learning techniques. The most common ToBI labels were assigned to these locations for default tones and break type. Predicted pitch accents were automatically aligned to the mid-point of the word, while breaks and edge tones were aligned to the end of the phrase-final word. The corpus was then labeled by a group of five trained transcribers working over a period of nine months. Half of each set of recordings was labeled in the standard fashion without default labels, and the other half was presented with preassigned default labels for labelers to correct. Results indicate that labeling from defaults was generally faster than standard labeling, and that defaults had relatively little impact on label assignment.

[1]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[2]  Ann K. Syrdal,et al.  Using tone similarity judgments in tests of intertranscriber reliability , 1999 .

[3]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[4]  Giuseppe Riccardi,et al.  Prosody recognition from speech utterances using acoustic and linguistic based models of prosodic events , 1999, EUROSPEECH.

[5]  E. Prince The ZPG Letter: Subjects, Definiteness, and Information-status , 1992 .

[6]  Julia Hirschberg,et al.  Training intonational phrasing rules automatically for English and Spanish text-to-speech , 1996, Speech Commun..

[7]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[8]  Ilse Lehiste,et al.  Contents, Vol. 5, Supplement, 1959 , 1960 .

[9]  Larry Wall,et al.  Programming Perl , 1991 .

[10]  Alfred V. Aho,et al.  The awk programming language , 1988 .

[11]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[12]  Julia Hirschberg,et al.  Disambiguating Cue Phrases in Text and Speech , 1990, COLING.

[13]  A. House,et al.  The Influence of Consonant Environment upon the Secondary Acoustical Characteristics of Vowels , 1953 .

[14]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[15]  G. E. Peterson,et al.  Duration of Syllable Nuclei in English , 1960 .

[16]  Linda R. Waugh,et al.  Contributions to grammatical studies : semantics and syntax , 1979 .

[17]  M. Liberman,et al.  The Stress and Structure of Modified Noun Phrases in English , 1992 .

[18]  Stefanie Shattuck-Hufnagel,et al.  The Use of Prosody in Syntactic Disambiguation , 1991, HLT.

[19]  David Yarowsky,et al.  A corpus-based synthesizer , 1992, ICSLP.

[20]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[21]  John Coleman,et al.  Acoustics of American English speech : a dynamic approach , 1993 .

[22]  S G Nooteboom,et al.  What Makes Speakers Omit Pitch Accents ? An Experiment , 1982, Phonetica.

[23]  Ilse Lehiste,et al.  An Acoustic – Phonetic Study of Internal Open Juncture , 1959 .

[24]  Julia Hirschberg,et al.  Automatic classification of intonational phrase boundaries , 1992 .

[25]  T. A. Knott,et al.  A Pronouncing Dictionary of American English , 1944 .

[26]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.