Synthesizing expressive speech from amateur audiobook recordings

Freely available audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from audiobooks that contained large amounts of highly expressive speech recorded from a professionally trained speaker. The majority of freely available audiobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesizing expressive speech from a typical online audiobook therefore poses many challenges. In this work we address these challenges by applying a method consisting of minimally supervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expressive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional message. We used a restricted amount of speech data in our experiment, in order to show that the method is generally applicable to most typical audiobooks widely available online.

[1]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[2]  Tibor Fegyó,et al.  Improved Recognition of Spontaneous Hungarian Speech—Morphological and Acoustic Modeling Techniques for a Less Resourced Task , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Kishore Prahallad,et al.  Handling large audio files in audio books for building synthetic voices , 2010, SSW.

[4]  Julie Carson-Berndsen,et al.  Evaluating expressive speech synthesis from audiobook corpora for conversational phrases , 2012, LREC.

[5]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[6]  Samuel Kaski,et al.  Self-Organized Formation of Various Invariant-Feature Filters in the Adaptive-Subspace SOM , 1997, Neural Computation.

[7]  Géza Németh,et al.  Improvements of Hungarian Hidden Markov Model-based Text-to-Speech Synthesis , 2010, Acta Cybern..

[8]  Mark J. F. Gales,et al.  Unsupervised clustering of emotion and voice styles for expressive TTS , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[10]  Nick Campbell,et al.  DATABASES OF EMOTIONAL SPEECH , 2000 .

[11]  Junichi Yamagishi,et al.  Towards an improved modeling of the glottal source in statistical parametric speech synthesis , 2007, SSW.

[12]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[13]  Géza Németh,et al.  Expressive Speech Synthesis Using Emotion-Specific Speech Inventories , 2008, COST 2102 Workshop.

[14]  Mark J. F. Gales,et al.  Lightly supervised recognition for automatic alignment of large coherent speech recordings , 2010, INTERSPEECH.

[15]  Michael Picheny,et al.  A corpus-based approach to expressive speech synthesis , 2004, SSW.

[16]  Géza Németh,et al.  The Effects of Phoneme Errors in Speaker Adaptation for HMM Speech Synthesis , 2011, INTERSPEECH.

[17]  Yong Zhao,et al.  Constructing stylistic synthesis databases from audio books , 2006, INTERSPEECH.

[18]  Steve Young,et al.  The HTK book , 1995 .

[19]  Julie Carson-Berndsen,et al.  Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters , 2011, INTERSPEECH.