Paragraph-based prosodic cues for speech synthesis applications

Speech synthesis has improved in both expressiveness and voice quality in recent years. However, obtaining full expressiveness when dealing with large multi-sentential synthesized discourse is still a challenge, since speech synthesizers do not take into account the prosodic differences that have been observed in discourse units such as paragraphs. The current study validates and extends previous work by analyzing the prosody of paragraph units in a large and diverse corpus of TED Talks using automatically extracted F0, intensity and timing features. In addition, a series of classification experiments was performed in order to identify which features are consistently used to distinguish paragraph breaks. The results show significant differences in prosody related to paragraph position. Moreover, the classification experiments show that boundary features such as pause duration and differences in F0 and intensity levels are the most consistent cues in marking paragraph boundaries. This suggests that these features should be taken into account when generating spoken discourse in order to improve naturalness and expressiveness.

[1]  Chiu-yu Tseng,et al.  Prosodic Fillers and Discourse Markers – Discourse Prosody and Text Prediction , 2006 .

[2]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[3]  Kevyn Collins-Thompson,et al.  Prominence prediction for supersentential prosodic modeling based on a new database , 2004, SSW.

[4]  Gina-Anne Levow,et al.  Assessing Prosodic and Text Features for Segmentation of Mandarin Broadcast News , 2004, HLT-NAACL 2004.

[5]  Mirella Lapata,et al.  Broad coverage paragraph segmentation across languages and domains , 2006, TSLP.

[6]  Sungbok Lee,et al.  How far, how long: on the temporal scope of prosodic boundary effects. , 2006, The Journal of the Acoustical Society of America.

[7]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Jody Kreiman,et al.  Perception of Sentence and Paragraph Bound-aries in Natural Conversation , 1982 .

[10]  Gina-Anne Levow,et al.  Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue , 2004, SIGDIAL Workshop.

[11]  Margaret Zellers Fundamental Frequency and Other Prosodic Cues to Topic Structure , 2009 .

[12]  I. Lehiste SOME PHONETIC CHARACTERISTICS OF DISCOURSE , 1982 .

[13]  Zhizheng Wu,et al.  Sentence-level control vectors for deep neural network speech synthesis , 2015, INTERSPEECH.

[14]  Julia Hirschberg,et al.  Some intonational characteristics of discourse structure , 1992, ICSLP.

[15]  M. Swerts Prosodic features at discourse boundaries of different strength. , 1997, The Journal of the Acoustical Society of America.

[16]  Kishore Prahallad,et al.  Sub-Phonetic Modeling For Capturing Pronunciation Variations For Conversational Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Julia Hirschberg,et al.  Acoustic indicators of topic segmentation , 1998, ICSLP.

[18]  James F. Allen,et al.  A Study on Prosody and Discourse Structure in Cooperative Dialogues , 1993 .

[19]  J M Terken,et al.  Beyond Sentence Prosody: Paragraph Intonation in Dutch , 1993, Phonetica.

[20]  Andreas Stolcke,et al.  A study in machine learning from imbalanced data for sentence boundary detection in speech , 2006, Comput. Speech Lang..

[21]  Jacques M. B. Terken,et al.  Prosodic realizations of global and local structure and rhetorical relations in read aloud news reports , 2009, Speech Commun..

[22]  Heidi Christensen,et al.  Maximum entropy segmentation of broadcast news , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[23]  J. Cole,et al.  Prosody in context: a review , 2015 .

[24]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[25]  Klaus Zechner,et al.  The importance of optimal parameter setting for pitch extraction. , 2010 .

[26]  Kishore Prahallad,et al.  Automatic building of synthetic voices from large multi-paragraph speech databases , 2007, INTERSPEECH.

[27]  Rebecca Herman,et al.  Phonetic markers of global discourse structures in English , 2000, J. Phonetics.

[28]  Caroline L. Smith,et al.  Topic transitions and durational prosody in reading aloud: production and modeling , 2004, Speech Commun..

[29]  Julia Hirschberg,et al.  A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues , 1996, ACL.

[30]  Mark J. F. Gales,et al.  Integrated automatic expression prediction and speech synthesis from text , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Marcela Charfuelan MARY TTS HMM-based voices for the Blizzard Challenge 2012 , 2012 .

[32]  Christer Gobl,et al.  Pitch declination and reset as a function of utterance duration in conversational speech data , 2015, INTERSPEECH.

[33]  Stéphane Rauzy,et al.  Automatic detection and prediction of topic changes through automatic detection of register variations and pause duration , 2009, INTERSPEECH.

[34]  M. Swerts,et al.  Prosody as a Marker of Information Flow in Spoken Discourse , 1994 .

[35]  Margaret Zellers,et al.  Combining Formal and Functional Approaches to Topic Structure , 2012, Language and speech.

[36]  Johanna D. Moore,et al.  Combining Multiple Knowledge Sources for Dialogue Segmentation in Multimedia Archives , 2007, ACL.

[37]  Marc Swerts,et al.  Prosodic cues to discourse boundaries in experimental dialouges , 1994, Speech Communication.

[38]  Mari Ostendorf,et al.  Prosodic and lexical indications of discourse structure in human-machine interactions , 1997, Speech Commun..

[39]  Gregor Möhler,et al.  A Discourse Model for Pitch-Range Control , 2001, SSW.