论文信息 - Prosodic and Spectral iVectors for Expressive Speech Synthesis

Prosodic and Spectral iVectors for Expressive Speech Synthesis

This work presents a study on the suitability of prosodic andacoustic features, with a special focus on i-vectors, in expressivespeech analysis and synthesis. For each utterance of two dif-ferent databases, a laboratory recorded emotional acted speech,and an audiobook, several prosodic and acoustic features are ex-tracted. Among them, i-vectors are built not only on the MFCCbase, but also on F0, power and syllable durations. Then, un-supervised clustering is performed using different feature com-binations. The resulting clusters are evaluated calculating clus-ter entropy for labeled portions of the databases. Additionally,synthetic voices are trained, applying speaker adaptive training,from the clusters built from the audiobook. The voices are eval-uated in a perceptual test where the participants have to edit anaudiobook paragraph using the synthetic voices.The objective results suggest that i-vectors are very use-ful for the audiobook, where different speakers (book charac-ters) are imitated. On the other hand, for the laboratory record-ings, traditional prosodic features outperform i-vectors. Also,a closer analysis of the created clusters suggest that differentspeakers use different prosodic and acoustic means to conveyemotions. The perceptual results suggest that the proposed i-vector based feature combinations can be used for audiobookclustering and voice training.

Antonio Bonafonte | Igor Jauk

[1] Florin Curelaru,et al. Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[2] Paula Lopez-Otero,et al. iVectors for Continuous Emotion Recognition , 2014 .

[3] Heiga Zen,et al. The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[4] Paul Boersma,et al. Praat: doing phonetics by computer , 2003 .

[5] R. Gray,et al. Vector quantization , 1984, IEEE ASSP Magazine.

[6] Björn W. Schuller,et al. Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[7] Albino Nogueiras,et al. Interface Databases: Design and Collection of a Multilingual Emotional Speech Database , 2002, LREC.

[8] Antonio Bonafonte,et al. Creating expressive synthetic voices by unsupervised clustering of audiobooks , 2015, INTERSPEECH.

[9] Simon King,et al. Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech , 2010, Speech Commun..

[10] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[11] George Karypis,et al. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[12] Mark J. F. Gales,et al. Unsupervised clustering of emotion and voice styles for expressive TTS , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Antonio Bonafonte,et al. Ogmios: The UPC Text-to-Speech synthesis system for Spoken Translation , 2006 .

[14] Richard M. Schwartz,et al. A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[16] Inma Hernáez,et al. Improved HNM-Based Vocoder for Statistical Synthesizers , 2011, INTERSPEECH.

[17] Julie Carson-Berndsen,et al. Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters , 2011, INTERSPEECH.

[18] Patrick Kenny,et al. Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.