Creating expressive synthetic voices by unsupervised clustering of audiobooks

In this work we design an approach for automatic feature selection and voice creation for expressive synthesis. Our approach is guided by two main goals: (1) increasing the flexibility of expressive voice creation and (2) overcoming the limitations of speaking styles in expressive synthesis. We define a novel set of features, combining traditionally used prosodic features with spectral features and proposing the use of iVectors. With these features we perform unsupervised clustering of an audiobook excerpt and, from these clusters, we create synthetic voices using the SAT technique. To evaluate the clustering performance we propose an objective evaluation of the unsupervised clustering results technique based on perplexity reduction. This objective evaluation indicates that both prosodic and spectral features contribute to separate speaking styles and emotions, achieving the best results when including iVectors in the feature set, leading to a perplexity reduction of the expressions and audiobook characters by factors 14 and 2, respectively. We also designed a novel subjective evaluation method where the participants have to edit a small excerpt of an audiobook using synthetic voices created from clusters. The results suggest that our feature set is effective in the task of expressiveness and character detection.

[1]  Michael Picheny,et al.  The IBM expressive speech synthesis system , 2004, INTERSPEECH.

[2]  Takao Kobayashi,et al.  Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[3]  Paula Lopez-Otero,et al.  iVectors for Continuous Emotion Recognition , 2014 .

[4]  Mark J. F. Gales,et al.  Unsupervised clustering of emotion and voice styles for expressive TTS , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[6]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[7]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[8]  David Escudero Mancebo,et al.  Corpus based extraction of quantitative prosodic parameters of stress groups in Spanish , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[10]  Marc Schröder,et al.  Expressive Speech Synthesis: Past, Present, and Possible Futures , 2009, Affective Information Processing.

[11]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[12]  Mark J. F. Gales,et al.  Integrated Expression Prediction and Speech Synthesis From Text , 2014, IEEE Journal of Selected Topics in Signal Processing.

[13]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Marcela Charfuelan,et al.  Expressive speech synthesis in MARY TTS using audiobook data and emotionML , 2013, INTERSPEECH.

[15]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[16]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[17]  Sumit Basu A linked-HMM model for robust voicing and speech detection , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[18]  Inma Hernáez,et al.  Improved HNM-Based Vocoder for Statistical Synthesizers , 2011, INTERSPEECH.

[19]  Antonio Bonafonte,et al.  Ogmios: The UPC Text-to-Speech synthesis system for Spoken Translation , 2006 .

[20]  Simon King,et al.  Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech , 2010, Speech Commun..

[21]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[22]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[23]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .