Prosodic and Spectral iVectors for Expressive Speech Synthesis

This work presents a study on the suitability of prosodic andacoustic features, with a special focus on i-vectors, in expressivespeech analysis and synthesis. For each utterance of two dif-ferent databases, a laboratory recorded emotional acted speech,and an audiobook, several prosodic and acoustic features are ex-tracted. Among them, i-vectors are built not only on the MFCCbase, but also on F0, power and syllable durations. Then, un-supervised clustering is performed using different feature com-binations. The resulting clusters are evaluated calculating clus-ter entropy for labeled portions of the databases. Additionally,synthetic voices are trained, applying speaker adaptive training,from the clusters built from the audiobook. The voices are eval-uated in a perceptual test where the participants have to edit anaudiobook paragraph using the synthetic voices.The objective results suggest that i-vectors are very use-ful for the audiobook, where different speakers (book charac-ters) are imitated. On the other hand, for the laboratory record-ings, traditional prosodic features outperform i-vectors. Also,a closer analysis of the created clusters suggest that differentspeakers use different prosodic and acoustic means to conveyemotions. The perceptual results suggest that the proposed i-vector based feature combinations can be used for audiobookclustering and voice training.

[1]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[2]  Paula Lopez-Otero,et al.  iVectors for Continuous Emotion Recognition , 2014 .

[3]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[4]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[5]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[6]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[7]  Albino Nogueiras,et al.  Interface Databases: Design and Collection of a Multilingual Emotional Speech Database , 2002, LREC.

[8]  Antonio Bonafonte,et al.  Creating expressive synthetic voices by unsupervised clustering of audiobooks , 2015, INTERSPEECH.

[9]  Simon King,et al.  Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech , 2010, Speech Commun..

[10]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[11]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[12]  Mark J. F. Gales,et al.  Unsupervised clustering of emotion and voice styles for expressive TTS , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Antonio Bonafonte,et al.  Ogmios: The UPC Text-to-Speech synthesis system for Spoken Translation , 2006 .

[14]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[16]  Inma Hernáez,et al.  Improved HNM-Based Vocoder for Statistical Synthesizers , 2011, INTERSPEECH.

[17]  Julie Carson-Berndsen,et al.  Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters , 2011, INTERSPEECH.

[18]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.