论文信息 - A bootstrapping approach to automating prosodic annotation for limited-domain synthesis

A bootstrapping approach to automating prosodic annotation for limited-domain synthesis

Most speech synthesis systems use symbolic prosody labels for marking emphasis and phrase structure. but in corpus-based approaches prosodic annotation of speech is a labor intensive process driving up the cost of development of new voices. This paper explores the potential for reducing that cost by using a bootstrapping approach to automatic prosodic annotation, particularly in a limited domain application. A perceptual experiment shows that using predominantly automatic prosody labels we can achieve nearly as high synthesis quality as if all data was hand-labeled.

M. Ostendorf | I. Bulyko

[1] Christopher D. Manning,et al. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[2] Nick Campbell,et al. Improving speech synthesis of CHATR using a perceptual discontinuity function and constraints of prosodic modification , 1998, SSW.

[3] Peter Jackson,et al. A phonologically motivated method of selecting non-uniform units , 1998, ICSLP.

[4] Mari Ostendorf,et al. Joint prosody prediction and unit selection for concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[6] Alan W. Black,et al. Limited domain synthesis , 2000, INTERSPEECH.

[7] Phillip Taylor,et al. Concept-to-speech synthesis by phonological structure matching , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[8] Mari Ostendorf,et al. Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[9] Michael Collins,et al. A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[10] Michael W. Macon,et al. A perceptual evaluation of distance measures for concatenative speech synthesis , 1998, ICSLP.

[11] James R. Glass,et al. Natural-sounding speech synthesis using variable-length units , 1998, ICSLP.

[12] Wayne H. Ward,et al. The CU communicator: an architecture for dialogue systems , 2000, INTERSPEECH.

[13] Julia Hirschberg,et al. Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[14] Mari Ostendorf,et al. Efficient integrated response generation from multiple targets using weighted finite state transducers , 2002, Comput. Speech Lang..

[15] Ann K. Syrdal,et al. Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis , 2000, INTERSPEECH.