Statistical modelling of speech segment duration by constrained tree regression

This paper presents a new method for statistical modelling of prosody control in speech synthesis. The proposed method, which is referred to as Constrained Tree Regression (CTR), can make suitable representation of complex effects of control factors for prosody with a moderate amount of learning data. It is based on recursive splits of predictor variable spaces and partial imposition of constraints of linear independence among predictor variables. It incorporates both linear and tree regressions with categorical predictor variables, which have been conventionally used for prosody control, and extends them to more general models. In addition, a hierarchical error function is presented to consider hierarchical structure in prosody control. This new method is applied to modelling of speech segmental duration. Experimental results show that better duration models are obtained by using the proposed regression method compared with linear and tree regressions using the same number of free parameters. It is also shown that the hierarchical structure of phoneme and syllable durations can be represented efficiently using the hierarchical error function. key words: speech segmental duration, statistical modelling, regression

[1]  W. Nick Campbell Predicting segmental durations for accommodation within a syllable-level timing framework , 1993, EUROSPEECH.

[2]  H. Sato,et al.  Two-stage F/sub 0/ control model using syllable based F/sub 0/ units , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  M. D. Riley Tree-based modeling of segmental durations , 1992 .

[4]  Philip A. Chou,et al.  Optimal Partitioning for Classification and Regression Trees , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  J. R. Quinlan Discovering rules by induction from large collections of examples Intro-ductory readings in expert s , 1979 .

[7]  Philip A. Chou,et al.  Optimal pruning with applications to tree-structured source coding and modeling , 1989, IEEE Trans. Inf. Theory.

[8]  Y. Sagisaka,et al.  Optimization of intonation control using statistical F/sub 0/ resetting characteristics , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[10]  Walter D. Fisher On Grouping for Maximum Homogeneity , 1958 .

[11]  Yoshinori Sagisaka,et al.  Automatic Extraction of F 0 Control Rules Using Statistical Analysis , 1997 .

[12]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[13]  Stephen Isard,et al.  Segment durations in a syllable frame , 1991 .

[14]  Riichiro Mizoguchi,et al.  Tree-Based Approaches to Automatic Generation of Speech Synthesis Rules for Prosodic Parameters : Special lssue on Speech Synthesis: Current Technologies and Equipment , 1993 .

[15]  R. Port Linguistic timing factors in combination. , 1981, The Journal of the Acoustical Society of America.

[16]  Katarina Bartkova,et al.  A model of segmental duration for speech synthesis in French , 1987, Speech Commun..

[17]  Chikio Hayashi On the quantification of qualitative data from the mathematico-statistical point of view , 1950 .

[18]  Gérard Bailly,et al.  Talking Machines: Theories, Models, and Designs , 1992 .

[19]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[20]  A. Huggins,et al.  The Perception of Timing in Natural Speech I: Compensation Within the Syllable , 1968, Language and speech.

[21]  J. V. Santen,et al.  The analysis of contextual effects on segmental duration , 1990 .