Modeling and Generating Tone Contour with Phrase Intonation for Mandarin Chinese Speech

This paper models F0 curves with discrete cosine transform (DCT) representations on both syllable-level tone and phrase-level intonation for Chinese Mandarin speech. Decision trees growing with maximum likelihood (ML) and stopping with minimum description length (MDL) are used to cluster very rich context-dependent DCT models into generalized ones to predict unseen contexts in test robustly. Additionally, we propose to generate Mandarin tone contours by jointly optimizing FO contours of syllable and phrase in ML sense. Experimental results on speaker-dependent continuous and speaker-independent isolated speech corpora show that the proposed approach can be able to generate FO contour with high correlation coefficients of 0.92 and 0.82 respectively, measured between the original and generated F0.

[1]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[2]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[3]  Frank K. Soong,et al.  Generating natural F0 trajectory with additive trees , 2008, INTERSPEECH.

[4]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[5]  Frank K. Soong,et al.  An HMM-Based Mandarin Chinese Text-To-Speech System , 2006, ISCSLP.

[6]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Keikichi Hirose,et al.  Two-step generation of Mandarin F0 contours based on tone nucleus and superpositional models , 2007, SSW.

[8]  趙 元任,et al.  A grammar of spoken Chinese = 中國話的文法 , 1968 .

[9]  Patricia Riddle,et al.  Modelling and synthesising F0 contours with the discrete cosine transform , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Paul Taylor,et al.  Using decision trees within the tilt intonation model to predict F0 contours , 1999, EUROSPEECH.

[11]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[12]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[13]  Shinsuke Sakai,et al.  Additive modeling of English F0 contour for speech synthesis , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[14]  Xuejing Sun F0 generation for speech synthesis using a multi-tier approach , 2002, INTERSPEECH.

[15]  Alan W. Black,et al.  Generating F/sub 0/ contours from ToBI labels using linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[16]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .