Pitch contour model for Chinese text-to-speech using CART and statistical model

This paper describes an approach to generating prosody parameters for Mandarin Chinese text-to-speech system. The Chinese fundamental frequency contour is decomposed into two parts, a global intonation contour and a syllable level tone contour. The global intonation contour is converted to pitch target labels in corpus. It is predicted by first predicting pitch target labels using statistical model and classification tree, and then the labels are converted into real pitch values. The local syllable level tone contour is classified into a definite number of contour types using clustering approach. The prediction of local syllable pitch contour is done by classification tree approach. Experiment shows that this approach achieves an accurate prediction result.