An Exploration of Local Speaking Rate Variations in Mandarin Read Speech

This paper explores speaking rate variation in Mandarin read speech. In contrast to assuming that each utterance is generated in a constant or global speaking rate, this study seeks to estimate local speaking rate for each prosodic unit in an utterance. The exploration is based on the existing speaking rate-dependent hierarchical prosodic model (SR-HPM). The main idea is to first use the SR-HPM to explore the prosodic structures of utterances and extract the prosodic units. Then, local speaking rate is estimated for each prosodic unit (prosodic phrase in this study). Some major influence factors including tone, base syllable type, prosodic structure, and speaking rate of the higher prosodic units (utterance and BG/PG) are compensated in the local SR estimation. A syntactic-local SR model is constructed and use in the prosody generation of Mandarin TTS. Experimental results on a large read speech corpus generated by a professional female announcer showed that the generated prosody with local speaking rate variations is proved to be more vivid than the one with a constant speaking rate.

[1]  Chen-Yu Chiang Cross-Dialect Adaptation Framework for Constructing Prosodic Models for Chinese Dialect Text-to-Speech Systems , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Denis Jouvet,et al.  About Handling Boundary Uncertainty in a Speaking Rate Dependent Modeling Approach , 2011, INTERSPEECH.

[3]  Takashi Masuko,et al.  A duration modeling technique with incremental speech rate normalization , 2010, INTERSPEECH.

[4]  Chen-Yu Chiang,et al.  Speaker adaptation of speaking rate-dependent hierarchical prosodic model for Mandarin TTS , 2014, ISCSLP.

[5]  Thilo Pfau,et al.  A combination of speaker normalization and speech rate normalization for automatic speech recognition , 2000, INTERSPEECH.

[6]  Chen-Yu Chiang,et al.  A New Approach of Speaking Rate Modeling for Mandarin Speech Prosody , 2012, INTERSPEECH.

[7]  Chen-Yu Chiang,et al.  Estimation of Hidden Speaking Rate , 2018 .

[8]  Daniel Povey,et al.  Speaking rate adaptation using continuous frame rate normalization , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Junichi Yamagishi,et al.  Synthesis of fast speech with interpolation of adapted HSMMs and its evaluation by blind and sighted listeners , 2010, INTERSPEECH.

[10]  Yang Li,et al.  Speech Rate Effects on Prosodic Features , 2006 .

[11]  Chiu-yu Tseng,et al.  Fluent speech prosody: Framework and modeling , 2005, Speech Commun..

[12]  Chen-Yu Chiang,et al.  A speaking rate-controlled Mandarin TTS system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Hugo Quené,et al.  Multilevel modeling of between-speaker and within-speaker variation in spontaneous speech tempo. , 2008, The Journal of the Acoustical Society of America.

[14]  Chen-Yu Chiang,et al.  Modeling of Speaking Rate Influences on Mandarin Speech Prosody and Its Application to Speaking Rate-controlled TTS , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  J. Trouvain,et al.  TEMPO VARIATION IN SPEECH PRODUCTION , 2003 .

[16]  Chen-Yu Chiang,et al.  Speaker Adaptation of SR-HPM for Speaking Rate-Controlled Mandarin TTS , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Sadaoki Furui,et al.  Speech-rate-variable HMM-based Japanese TTS system , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[18]  Horacio Franco,et al.  RATE-OF-SPEECH MODELING FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION , 2003 .

[19]  Chen-Yu Chiang,et al.  An investigation on the Mandarin prosody of a parallel multi-speaking rate speech corpus , 2009, 2009 Oriental COCOSDA International Conference on Speech Database and Assessments.

[20]  Sin-Horng Chen,et al.  Vector quantization of pitch information in Mandarin speech , 1990, IEEE Trans. Commun..

[21]  Sadaoki Furui,et al.  Hidden mode HMM using Bayesian network for modeling speaking rate fluctuation , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[22]  Chen-Yu Chiang,et al.  Structural maximum a posteriori speaker adaptation of speaking rate-dependent hierarchical prosodic model for Mandarin TTS , 2014, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Keiichi Tokuda,et al.  Large-Scale Subjective Evaluations of Speech Rate Control Methods for HMM-Based Speech Synthesizers , 2011, INTERSPEECH.

[24]  Reza Lotfian,et al.  Emotion recognition using synthetic speech as neutral reference , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  R. Fox,et al.  Articulation rate across dialect, age, and gender , 2009, Language Variation and Change.