A bootstrapping approach to automating prosodic annotation for limited-domain synthesis

Most speech synthesis systems use symbolic prosody labels for marking emphasis and phrase structure. but in corpus-based approaches prosodic annotation of speech is a labor intensive process driving up the cost of development of new voices. This paper explores the potential for reducing that cost by using a bootstrapping approach to automatic prosodic annotation, particularly in a limited domain application. A perceptual experiment shows that using predominantly automatic prosody labels we can achieve nearly as high synthesis quality as if all data was hand-labeled.

[1]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[2]  Nick Campbell,et al.  Improving speech synthesis of CHATR using a perceptual discontinuity function and constraints of prosodic modification , 1998, SSW.

[3]  Peter Jackson,et al.  A phonologically motivated method of selecting non-uniform units , 1998, ICSLP.

[4]  Mari Ostendorf,et al.  Joint prosody prediction and unit selection for concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[6]  Alan W. Black,et al.  Limited domain synthesis , 2000, INTERSPEECH.

[7]  Phillip Taylor,et al.  Concept-to-speech synthesis by phonological structure matching , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[8]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[9]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[10]  Michael W. Macon,et al.  A perceptual evaluation of distance measures for concatenative speech synthesis , 1998, ICSLP.

[11]  James R. Glass,et al.  Natural-sounding speech synthesis using variable-length units , 1998, ICSLP.

[12]  Wayne H. Ward,et al.  The CU communicator: an architecture for dialogue systems , 2000, INTERSPEECH.

[13]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[14]  Mari Ostendorf,et al.  Efficient integrated response generation from multiple targets using weighted finite state transducers , 2002, Comput. Speech Lang..

[15]  Ann K. Syrdal,et al.  Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis , 2000, INTERSPEECH.