Breathing and Speech Planning in Spontaneous Speech Synthesis

Breathing and speech planning in spontaneous speech are coordinated processes, often exhibiting disfluent patterns. While synthetic speech is not subject to respiratory needs, integrating breath into synthesis has advantages for naturalness and recall. At the same time, a synthetic voice reproducing disfluent breathing patterns learned from the data can be problematic. To address this, we first propose training stochastic TTS on a corpus of overlapping breath-group bigrams, to take context into account. Next, we introduce an unsupervised automatic annotation of likely-disfluent breath events, through a product-of-experts model that combines the output of two breath- event predictors, each using complementary information and operating in opposite directions. This annotation enables creating an automatically-breathing spontaneous speech synthesiser with a more fluent breathing style. A subjective evaluation on two spoken genres (impromptu and rehearsed) found the proposed system to be preferred over the baseline approach treating all breath events the same.

[1]  John Kane,et al.  Detecting a targeted voice style in an audiobook using voice quality features , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  J. Hoit,et al.  Cognitive-linguistic demands and speech breathing. , 1996, Journal of speech and hearing research.

[3]  Rachel McDonnell,et al.  Investigating the use of recurrent motion modelling for speech gesture generation , 2018, IVA.

[4]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[5]  Rob Clark,et al.  Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[6]  Heiga Zen,et al.  Product of Experts for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  A. Winkworth,et al.  Variability and consistency in speech breathing during reading: lung volumes, speech intensity, and linguistic factors. , 1994, Journal of speech and hearing research.

[8]  Gérard Bailly,et al.  Is breathing sensitive to the communication partner , 2014 .

[9]  Mattias Heldner,et al.  Respiratory Constraints in Verbal and Non-verbal Communication , 2017, Front. Psychol..

[10]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[11]  Norbert Braunschweiler,et al.  Automatic detection of inhalation breath pauses for improved pause modelling in HMM-TTS , 2013, SSW.

[12]  P. Barbosa,et al.  The Interplay between Speech and Breathing across three Brazilian Portuguese Speaking Styles , 2018, Speech Prosody 2018.

[13]  Steve DiPaola,et al.  Speech Breathing in Virtual Humans: An Interactive Model and Empirical Study , 2019, 2019 IEEE Virtual Humans and Crowds for Immersive Environments (VHCIE).

[14]  D H Whalen,et al.  The effects of breath sounds on the perception of synthetic speech. , 1995, The Journal of the Acoustical Society of America.

[15]  Raymond D. Kent,et al.  Breath Group Analysis for Reading and Spontaneous Speech in Healthy Adults , 2010, Folia Phoniatrica et Logopaedica.

[16]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Gérard Bailly,et al.  Pauses and respiratory markers of the structure of book reading , 2012, INTERSPEECH.

[18]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[19]  Michael Schoeffler,et al.  webMUSHRA — A Comprehensive Framework for Web-based Listening Tests , 2018 .

[20]  Adam Nadolski,et al.  Phrase Break Prediction for Long-Form Reading TTS: Exploiting Text Structure Information , 2017, INTERSPEECH.

[21]  Paavo Alku,et al.  Bandwidth extension of telephone speech using a filter bank implementation for highband MEL spectrum , 2010, 2010 18th European Signal Processing Conference.

[22]  Gustav Eje Henter,et al.  Casting to Corpus: Segmenting and Selecting Spontaneous Dialogue for Tts with a Cnn-lstm Speaker-dependent Breath Detector , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Shachar Mirkin,et al.  Towards Effective Rebuttal: Listening Comprehension Using Corpus-Wide Claim Mining , 2019, ArgMining@ACL.

[24]  Joakim Gustafson,et al.  How to train your fillers: uh and um in spontaneous speech synthesis , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[25]  Shrikanth S. Narayanan,et al.  An empirical text transformation method for spontaneous speech synthesizers , 2003, INTERSPEECH.

[26]  F. Goldman-Eisler,et al.  Temporal Patterns of Cognitive Activity and Breath Control in Speech , 1965, Language and speech.

[27]  Anne-Catherine Simon,et al.  A Model for Varying Speaking Style in TTS systems , 2010 .

[28]  Joakim Gustafson,et al.  Spontaneous Conversational Speech Synthesis from Found Data , 2019, INTERSPEECH.