Hierarchical stress modeling and generation in mandarin for expressive Text-to-Speech

Expressive speech synthesis has received increased attention in recent times. Stress (or pitch accent) is the perceptual prominence within words or utterances, which contributes to the expressivity of speech. This paper summarizes our contribution to Mandarin expressive speech synthesis. A novel hierarchical stress modeling and generation method for Mandarin is proposed and further integrated into HMM-based speech synthesis (HTS) and Fujisaki model-based speech synthesis systems to accurately model the undulation of pitch contour. In HMM-based expressive speech synthesis, stress-related contextual features obtained from the hierarchical model are introduced in modeling the prosodic variation caused by stress, in addition to the traditional prosodic features used in HTS. A rule-based and a Deep Belief Network based prosodic variation models are proposed and then used in stress adaptation module in HTS. The other approach uses the Fujisaki model to improve the expressiveness of synthetic speech. The hierarchical stress model is introduced into the phrase and tone command control mechanisms of the model. The pitch contour is then directly generated by the superposition of two-level commands of the Fujisaki model. Experimental results using the proposed hierarchical stress modeling and generation methods showed that the macro- and microcharacteristics of stress could be successfully captured. The methodology proposed in this paper has application to a range of areas such as conveying attitude and indicating focus in spoken dialog systems. (C) 2015 Elsevier B.V. All rights reserved.

[1]  Frank K. Soong,et al.  A hierarchical F0 modeling method for HMM-based speech synthesis , 2010, INTERSPEECH.

[2]  Elisabeth Selkirk,et al.  Sentence Prosody: Intonation, Stress and Phrasing , 1996 .

[3]  Takashi Nose,et al.  Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis , 2014, Speech Commun..

[4]  Heiga Zen,et al.  Deep learning in speech synthesis , 2013, SSW.

[5]  Keikichi Hirose,et al.  Synthesis of F0 contours using generation process model parameters predicted from unlabeled corpora: application to emotional speech synthesis , 2005, Speech Commun..

[6]  Michael Picheny,et al.  The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[8]  M. Ortega-Llebaria,et al.  DISENTANGLING STRESS FROM ACCENT IN SPANISH: PRODUCTION PATTERNS OF THE STRESS CONTRAST IN DEACCENTED SYLLABLES * , 2005 .

[9]  Heiga Zen,et al.  Context-dependent additive log f_0 model for HMM-based speech synthesis , 2009, INTERSPEECH.

[10]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Junichi Yamagishi,et al.  HMM-BASED EXPRESSIVE SPEECH SYNTHESIS — TOWARDS TTS WITH ARBITRARY SPEAKING STYLES AND EMOTIONS , 2003 .

[12]  Heiga Zen,et al.  Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis , 2011, Speech Commun..

[13]  Ya Li,et al.  Hierarchical Stress Modeling in Mandarin Text-to-Speech , 2011, INTERSPEECH.

[14]  Jianfen Cao,et al.  On neutral-tone syllables in Mandarin Chinese , 1992 .

[15]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Julia Hirschberg,et al.  Detecting pitch accent using pitch-corrected energy-based predictors , 2007, INTERSPEECH.

[17]  Junichi Yamagishi,et al.  Glottal Source and Prosodic Prominence Modelling in HMM-based Speech Synthesis for the Blizzard Challenge 2009 , 2009 .

[18]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Y Xu,et al.  Production and perception of coarticulated tones. , 1994, The Journal of the Acoustical Society of America.

[20]  Hiroya Fujisaki,et al.  Dynamic Characteristics of Voice Fundamental Frequency in Speech and Singing , 1983 .

[21]  Xu Jiepin The influence of Chinese sentence stress on pitch and duration , 2000 .

[22]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[23]  Ting,et al.  Study on automatic prediction of sentential stress for Chinese Putonghua Text-to-Speech system with natural style , 2007 .

[24]  Gao Peng Chen,et al.  Quantitative Analysis and Synthesis of Focus in Mandarin , 2004 .

[25]  Yasemin Altun,et al.  Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech , 2004, ACL.

[26]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[27]  Junichi Yamagishi,et al.  Identification of contrast and its emphatic realization in HMM based speech synthesis , 2009, INTERSPEECH.

[28]  Keikichi Hirose,et al.  Prosodic focus control in reply speech generation for a spoken dialogue system of information retrieval , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[29]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[30]  Ming Lei,et al.  Investigation of prosodie FO layers in hierarchical FO modeling for HMM-based speech synthesis , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[31]  Simon King,et al.  Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech , 2010, Speech Commun..

[32]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.

[33]  Takao Kobayashi,et al.  Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing , 2005, IEICE Trans. Inf. Syst..

[34]  Keikichi Hirose,et al.  Hierarchical stress generation with Fujisaki model in expressive speech synthesis , 2014 .

[35]  Julia Hirschberg,et al.  Detecting Pitch Accent Using Pitch-corrected Energy-based Predictors , 2007 .

[36]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[37]  Marc Schröder,et al.  Expressive Speech Synthesis: Past, Present, and Possible Futures , 2009, Affective Information Processing.

[38]  Hiroya Fujisaki,et al.  Information, prosody, and modeling - with emphasis on tonal features of speech - , 2004, Speech Prosody 2004.

[39]  Ya Li,et al.  The Stability Analysis of Disyllabic Stress in Mandarin Speech , 2011, ICPhS.

[40]  Shrikanth S. Narayanan,et al.  Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Xuejing Sun,et al.  Pitch accent prediction using ensemble machine learning , 2002, INTERSPEECH.

[42]  N. Campbell,et al.  Conversational speech synthesis and the need for some laughter , 2005, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Bo Xu,et al.  From English pitch accent detection to Mandarin stress detection, where is the difference? , 2012, Comput. Speech Lang..

[44]  Xiaoying Xu,et al.  Influence of rhythm and tone pattern on Mandarin stress perception in continuous speech , 2011 .

[45]  趙 元任,et al.  A grammar of spoken Chinese = 中國話的文法 , 1968 .

[46]  Anne Cutler,et al.  Stress and accent in language production and understanding , 1984 .

[47]  Keikichi Hirose,et al.  Control of prosodic focus in corpus-based generation of fundamental frequency contours of Japanese based on the generation process model , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Meng Zhang,et al.  Text-based unstressed syllable prediction in Mandarin , 2010, INTERSPEECH.

[49]  Simon King,et al.  Modelling prominence and emphasis improves unit-selection synthesis , 2007, INTERSPEECH.

[50]  Zhu Weibin A Chinese Speech Synthesis System with Capability of Accent Realizing , 2007 .

[51]  Nick Campbell,et al.  Speech Database Design for a Concatenative Text-to-Speech Synthesis System for Individuals with Communication Disorders , 2003, Int. J. Speech Technol..

[52]  Frank K. Soong,et al.  Generating natural F0 trajectory with additive trees , 2008, INTERSPEECH.

[53]  Takao Kobayashi,et al.  A style control technique for HMM-based speech synthesis , 2004, INTERSPEECH.

[54]  Lianhong Cai,et al.  Modeling Prosody Pattern of Chinese Expressive Speech and Its Application in Personalized Speech Conversion , 2012 .

[55]  Kai Yu,et al.  Word-level emphasis modelling in HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[56]  Dirk Heylen,et al.  Generating expressive speech for storytelling applications , 2006, IEEE Transactions on Audio, Speech, and Language Processing.