Unsupervised Learning for Expressive Speech Synthesis

Nowadays, especially with the upswing of neural networks, speech synthesis is almost totally data driven. The goal of this thesis is to provide methods for automatic and unsupervised learning from data for expressive speech synthesis. In comparison to "ordinary" synthesis systems, it is more difficult to find reliable expressive training data, despite huge availability on sources like Internet. The main difficulty consists in the highly speaker- and situation-dependent nature of expressiveness, causing many and acoustically substantial variations. The consequences are, first, it is very difficult to define labels which reliably identify expressive speech with all nuances. The typical definition of 6 basic emotions, or alike, is a simplification which will have inexcusable consequences dealing with data outside the lab. Second, even if a label set is defined, apart of the enormous manual effort, it is difficult to gain sufficient training data for the models respecting all the nuances and variations. The goal of this thesis is to study automatic training methods for expressive speech synthesis avoiding labeling and to develop applications from these proposals. The focus lies on the acoustic and the semantic domains. For the part of the acoustic domain, the goal is to find suitable acoustic features to represent expressive speech, especially for the multi-speaker domain, as getting closer to real-life uncontrolled data. For this, the perspective will slide away from traditional, mainly prosody-based, features towards features gained with factor analysis, trying to identify the principal components of the expressiveness, namely using i-vectors. Results show that a combination of traditional and i-vector based features performs better in unsupervised clustering of expressive speech than traditional features and even better than large state-of-the-art sets in the multi-speaker domain. Once the feature set is defined, it is used for unsupervised clustering of an audiobook, where from each cluster a voice is trained. Then, the method is evaluated in an audiobook-editing application, where users can use the synthetic voices to create their own dialogues. The obtained results validate the proposal. In this editing application users choose synthetic voices and assign them to the sentences considering the speaking characters and the expressiveness. Involving the semantic domain, this assignment can be achieved automatically, at least partly. Words and sentences are represented numerically in trainable semantic vector spaces, called embeddings, and these can be used to predict the expressiveness to some extent. This method not only permits fully automatic reading of larger text passages, considering the local context, but can also be used as a semantic search engine for training data. Both applications are evaluated in a perceptual test showing the potential of the proposed method. Finally, accounting for the new tendencies in the speech synthesis world, deep neural network based expressive speech synthesis is designed and tested. Emotionally motivated semantic representations of text, sentiment embeddings, trained on the positiveness and the negativeness of movie reviews, are used as an additional input to the system. The neural network now learns not only from segmental and contextual information, but also from the sentiment embeddings, affecting especially prosody. The system is evaluated in two perceptual experiments which show preferences for the inclusion of sentiment embeddings as an additional input.

[1]  J. Pierrehumbert The phonology and phonetics of English intonation , 1987 .

[2]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[3]  Oliver Watts,et al.  Towards speaking style transplantation in speech synthesis , 2013, SSW.

[4]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[5]  Simon King,et al.  Towards minimum perceptual error training for DNN-based speech synthesis , 2015, INTERSPEECH.

[6]  Antonio Bonafonte,et al.  Ogmios: The UPC Text-to-Speech synthesis system for Spoken Translation , 2006 .

[7]  Andrej Ljolje,et al.  Synthesis of natural sounding pitch contours in isolated utterances using hidden Markov models , 1986, IEEE Trans. Acoust. Speech Signal Process..

[8]  Takao Kobayashi,et al.  Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[9]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[10]  Antonio Bonafonte,et al.  Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model , 2016, SSW.

[11]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[12]  Takao Kobayashi,et al.  A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features , 2006, IEICE Trans. Inf. Syst..

[13]  Kenneth Kuttler,et al.  An Introduction To Linear Algebra , 2008 .

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  J. Q. Stewart An Electrical Analogue of the Vocal Organs , 1922, Nature.

[16]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[17]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[18]  Gemma Boleda,et al.  Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus , 2010, LREC.

[19]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Mark J. F. Gales,et al.  Unsupervised clustering of emotion and voice styles for expressive TTS , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Antonio Bonafonte,et al.  Direct Expressive Voice Training Based on Semantic Selection , 2016, INTERSPEECH.

[22]  Mark Liberman,et al.  The intonational system of English , 1979 .

[23]  Tatyana V. Polyàkova,et al.  Grapheme-to-Phoneme Conversion in the Era of Globalization , 2015 .

[24]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[25]  Mark J. F. Gales,et al.  Integrated Expression Prediction and Speech Synthesis From Text , 2014, IEEE Journal of Selected Topics in Signal Processing.

[26]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[27]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Bernd Möbius,et al.  Ein quantitatives Modell der deutschen Intonation , 1993 .

[29]  Marc Schröder,et al.  Expressive Speech Synthesis: Past, Present, and Possible Futures , 2009, Affective Information Processing.

[30]  B.-H. Juang,et al.  On the hidden Markov model and dynamic time warping for speech recognition — A unified view , 1984, AT&T Bell Laboratories Technical Journal.

[31]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[32]  Zhizheng Wu,et al.  Investigating gated recurrent neural networks for speech synthesis , 2016, ArXiv.

[33]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[34]  Tamás Gábor Csapó,et al.  Synthesizing expressive speech from amateur audiobook recordings , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[35]  Paula Lopez-Otero,et al.  iVectors for Continuous Emotion Recognition , 2014 .

[36]  Mark A. Musen,et al.  The protégé project: a look back and a look forward , 2015, SIGAI.

[37]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[38]  Junichi Yamagishi,et al.  Average-Voice-Based Speech Synthesis , 2006 .

[39]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[40]  Jean Vroomen,et al.  Duration and intonation in emotional speech , 1993, EUROSPEECH.

[41]  Antonio Bonafonte,et al.  Acoustic feature prediction from semantic features for expressive speech using deep neural networks , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[42]  Shigeo Morishima,et al.  Emotion modeling in speech production using emotion space , 1996, Proceedings 5th IEEE International Workshop on Robot and Human Communication. RO-MAN'96 TSUKUBA.

[43]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[44]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[45]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[46]  Takao Kobayashi,et al.  Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing , 2005, IEICE Trans. Inf. Syst..

[47]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[48]  R. Bensraj,et al.  An Efficient Sentence-based Sentiment Analysis for Expressive Text-to-speech using Fuzzy Neural Network , 2014 .

[49]  Richard Wiese Silbische und lexikalische Phonologie : Studien zum Chinesischen und Deutschen , 1988 .

[50]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[51]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[52]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[53]  Björn W. Schuller,et al.  Recognizing Affect from Linguistic Information in 3D Continuous Space , 2011, IEEE Transactions on Affective Computing.

[54]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[55]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[56]  Peter Birkholz,et al.  Articulatory Synthesis of Speech and Singing: State of the Art and Suggestions for Future Research , 2009, COST 2102 School.

[57]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[58]  J. Pennebaker,et al.  The Secret Life of Pronouns , 2003, Psychological science.

[59]  Kai Yu,et al.  Cluster Adaptive Training for Deep Neural Network Based Acoustic Model , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[60]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[61]  Antonio Bonafonte,et al.  Creating expressive synthetic voices by unsupervised clustering of audiobooks , 2015, INTERSPEECH.

[62]  Jaime Lorenzo Trueba Design and Evaluation of Statistical Parametric Techniques in Expressive Text-To-Speech: Emotion and Speaking Styles Transplantation , 2016 .

[63]  Hai Zhao,et al.  Word embedding for recurrent neural network based TTS synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Simon King,et al.  Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech , 2010, Speech Commun..

[65]  Xin Wang,et al.  Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis , 2016, IEICE Trans. Inf. Syst..

[66]  J. van den Berg Myoelastic-aerodynamic theory of voice production. , 1958, Journal of speech and hearing research.

[67]  Charlotte Wollermann Prosodie, nonverbale Signale, Unsicherheit und Kontext: Studien zur pragmatischen Fokusinterpretation , 2013 .

[68]  Antonio Bonafonte,et al.  Automatic voice-source parameterization of natural speech , 2005, INTERSPEECH.

[69]  Heiga Zen,et al.  An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[70]  Julie Carson-Berndsen,et al.  Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters , 2011, INTERSPEECH.

[71]  Mike Rozak Text-to-speech Designed For a Massively Multiplayer Online Role-Playing Game (MMORPG) , 2007 .

[72]  Junichi Yamagishi,et al.  Expressive Speech Synthesis Using Sentiment Embeddings , 2018, INTERSPEECH.

[73]  Takashi Nose,et al.  HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker's Voice Using Model Adaptation , 2009, IEICE Trans. Inf. Syst..

[74]  Fasih Haider,et al.  Prediction of Emotions from Text using Sentiment Analysis for Expressive Speech Synthesis , 2016, SSW.

[75]  Tillman Weyde,et al.  A Neural Probabilistic Model for Predicting Melodic Sequences , 2013 .

[76]  Zhizheng Wu,et al.  A study of speaker adaptation for DNN-based speech synthesis , 2015, INTERSPEECH.

[77]  Shinji Takaki,et al.  Constructing a Deep Neural Network Based Spectral Model for Statistical Speech Synthesis , 2016, Recent Advances in Nonlinear Speech Processing.

[78]  Antonio Bonafonte,et al.  Prosodic and Spectral iVectors for Expressive Speech Synthesis , 2016, SSW.

[79]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[80]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[81]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[82]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[83]  Jan P. H. van Santen,et al.  Assignment of segmental duration in text-to-speech synthesis , 1994, Comput. Speech Lang..

[84]  Inma Hernáez,et al.  Improved HNM-Based Vocoder for Statistical Synthesizers , 2011, INTERSPEECH.

[85]  John L. Arnott,et al.  Implementation and testing of a system for producing emotion-by-rule in synthetic speech , 1995, Speech Commun..

[86]  智基 戸田,et al.  Recent developments of the HMM-based speech synthesis system (HTS) , 2007 .

[87]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[88]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.

[89]  D. Klatt Letter: Interaction between two factors that influence vowel duration. , 1973, The Journal of the Acoustical Society of America.

[90]  Susan Fitt,et al.  Robust LTS rules with the Combilex speech technology lexicon , 2009, INTERSPEECH.

[91]  K. Stevens,et al.  An Electrical Analog of the Vocal Tract , 1953 .

[92]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[93]  S. King,et al.  Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis , 2013, SSW.

[94]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[95]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[96]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[97]  Jordi Luque,et al.  Jitter and Shimmer Measurements for Speaker Diarization , 2014 .

[98]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[99]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[100]  Gerold Ungeheuer Elemente einer Akustischen Theorie der Vokalartikulation , 1962 .

[101]  Junichi Yamagishi,et al.  Towards Cross-Lingual Emotion Transplantation , 2014, IberSPEECH.

[102]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[103]  Francesc Alías,et al.  Sentence-Based Sentiment Analysis for Expressive Text-to-Speech , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[104]  J. Montero,et al.  ANALYSIS AND MODELLING OF EMOTIONAL SPEECH IN SPANISH , 1999 .

[105]  Michael Picheny,et al.  The IBM expressive speech synthesis system , 2004, INTERSPEECH.

[106]  Paul Taylor,et al.  A Phonetic Model of English Intonation , 1992 .

[107]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[108]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[109]  Oliver Watts,et al.  Unsupervised learning for text-to-speech synthesis , 2013 .

[110]  Antonio Bonafonte,et al.  Prosodic Break Prediction with RNNs , 2016, IberSPEECH.

[111]  G. N. Lance,et al.  Mixed-Data Classificatory Programs I - Agglomerative Systems , 1967, Aust. Comput. J..