Overcoming the limitations of statistical parametric speech synthesis

At the time of beginning this thesis, statistical parametric speech synthesis (SPSS) using hidden Markov models (HMMs) was the dominant synthesis paradigm within the research community. SPSS systems are effective at generalising across the linguistic contexts present in training data to account for inevitable unseen linguistic contexts at synthesis-time, making these systems flexible and their performance stable. However HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning that, despite great progress, the speech output is rarely confused for natural speech. There are many hypotheses for the causes of reduced synthesis quality, and subsequent required improvements, for HMM speech synthesis in literature. However, until this thesis, these hypothesised causes were rarely tested. This thesis makes two types of contributions to the field of speech synthesis; each of these appears in a separate part of the thesis. Part I introduces a methodology for testing hypothesised causes of limited quality within HMM speech synthesis systems. This investigation aims to identify what causes these systems to fall short of natural speech. Part II uses the findings from Part I of the thesis to make informed improvements to speech synthesis. The usual approach taken to improve synthesis systems is to attribute reduced synthesis quality to a hypothesised cause. A new system is then constructed with the aim of removing that hypothesised cause. However this is typically done without prior testing to verify the hypothesised cause of reduced quality. As such, even if improvements in synthesis quality are observed, there is no knowledge of whether a real underlying issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a wide range of perceptual tests in Part I of the thesis to discover what the real underlying causes of reduced quality in HMM synthesis are and the level to which they contribute. Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements to synthesis quality. Two well-motivated improvements to standard HMM synthesis are investigated. The first of these improvements follows on from averaging across differing linguistic contexts being identified as a major contributing factor to reduced synthesis quality. This is a practice typically performed during decision tree regression in HMM synthesis. Therefore a system which removes averaging across differing linguistic contexts and instead performs averaging only across matching linguistic contexts (called rich-context synthesis) is investigated. The second of the motivated improvements follows the finding that the parametrisation (i.e., vocoding) of speech, standard practice in SPSS, introduces a noticeable drop in quality before any

[1]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[2]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[3]  Keiichi Tokuda,et al.  Introduction to the Issue on Statistical Parametric Speech Synthesis , 2014, IEEE J. Sel. Top. Signal Process..

[4]  Paavo Alku,et al.  The GlottHMM Entry for Blizzard Challenge 2012: Hybrid Approach , 2012 .

[5]  Ren-Hua Wang,et al.  The USTC System for Blizzard Challenge 2010 , 2008 .

[6]  Paavo Alku,et al.  Wavelets for intonation modeling in HMM speech synthesis , 2013, SSW.

[7]  Junichi Yamagishi,et al.  An investigation of the application of dynamic sinusoidal models to statistical parametric speech synthesis , 2014, INTERSPEECH.

[8]  Simon King,et al.  Smooth talking: Articulatory join costs for unit selection , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Cassia Valentini-Botinhao,et al.  Hurricane natural speech corpus , 2013 .

[10]  Tomoki Toda,et al.  Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[11]  Vincent Pollet,et al.  Refined inter-segment joining in multi-form speech synthesis , 2014, INTERSPEECH.

[12]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[13]  Paavo Alku,et al.  HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Philip J. B. Jackson,et al.  Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech , 2001, IEEE Trans. Speech Audio Process..

[16]  Simon King,et al.  Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech , 2014, INTERSPEECH.

[17]  Tomoki Toda,et al.  Parameter generation algorithm considering Modulation Spectrum for HMM-based speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Thierry Dutoit,et al.  The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  小石田和人 Low Bit Rate Speech Coding Based on Mel-Generalized Cepstral Analysis(メル一般化ケプストラム分析に基づく低ビットレート音声符号化) , 1998 .

[20]  Tomoki Toda,et al.  Improvements to HMM-based speech synthesis based on parameter generation with rich context models , 2013, INTERSPEECH.

[21]  Heiga Zen,et al.  An overview of nitech HMM-based speech synthesis system for blizzard challenge 2005 , 2005, INTERSPEECH.

[22]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tuomo Raitio,et al.  DNN-based stochastic postfilter for HMM-based speech synthesis , 2014, INTERSPEECH.

[24]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Junichi Yamagishi,et al.  A fixed dimension and perceptually based dynamic sinusoidal model of speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Cassia Valentini-Botinhao,et al.  Intelligibility enhancement of synthetic speech in noise , 2013 .

[27]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[28]  Paavo Alku,et al.  Comparing glottal-flow-excited statistical parametric speech synthesis methods , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Simon King,et al.  Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis , 2011, Speech Commun..

[30]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[31]  Vincent Pollet,et al.  Uniform Speech Parameterization for Multi-Form Segment Synthesis , 2011, INTERSPEECH.

[32]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[33]  Tomoki Toda,et al.  An Evaluation of Parameter Generation Methods with Rich Context Models in HMM-Based Speech Synthesis , 2012, INTERSPEECH.

[34]  Yamato Ohtani,et al.  Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Qian Yao A UNIFIED TRAJECTORY TILING APPROACH TO HIGH QUALITY SPEECH RENDERING , 2013 .

[36]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[37]  Heiga Zen,et al.  Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Moncef Gabbouj,et al.  Ways to Implement Global Variance in Statistical Speech Synthesis , 2012, INTERSPEECH.

[40]  Simon King,et al.  Investigating source and filter contributions, and their interaction, to statistical parametric speech synthesis , 2014, INTERSPEECH.

[41]  Simon King,et al.  Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[43]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[44]  Heiga Zen,et al.  The Effect of Using Normalized Models in Statistical Speech Synthesis , 2011, INTERSPEECH.

[45]  Tomoki Toda,et al.  Modified post-filter to recover modulation spectrum for HMM-based speech synthesis , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[46]  Heiga Zen,et al.  A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[47]  Daniel Erro,et al.  Flexible harmonic/stochastic speech synthesis , 2007, SSW.

[48]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[49]  Simon King,et al.  Multidimensional scaling of listener responses to synthetic speech , 2005, INTERSPEECH.

[50]  Junichi Yamagishi,et al.  Multiple feed-forward deep neural networks for statistical parametric speech synthesis , 2015, INTERSPEECH.

[51]  Cassia Valentini-Botinhao,et al.  Intelligibility-enhancing speech modifications: the hurricane challenge , 2020, INTERSPEECH.

[52]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[53]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[54]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Ren-Hua Wang,et al.  Minimum unit selection error training for HMM-based unit selection speech synthesis system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[56]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[57]  Hermann Ney,et al.  Evaluation of VTLN-based voice conversion for embedded speech synthesis , 2005, INTERSPEECH.

[58]  I. Titze Nonlinear source-filter coupling in phonation: theory. , 2008, The Journal of the Acoustical Society of America.

[59]  Yoshihiko Nankaku,et al.  The effect of neural networks in statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Simon King,et al.  Robustness of HMM-based speech synthesis , 2008, INTERSPEECH.

[61]  Paul Taylor,et al.  The target cost formulation in unit selection speech synthesis , 2006, INTERSPEECH.

[62]  Alistair Conkie A robust unit selection system for speech synthesis , 1999 .

[63]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[64]  Simon King,et al.  Measuring a decade of progress in Text-to-Speech , 2014 .

[65]  Final Report : OUCH Project ( Outing Unfortunate Characteristics of HMMs ) , 2013 .

[66]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[67]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[68]  A. Bonafonte,et al.  FLEXIBLE HARMONIC / STOCHASTIC MODELING FOR HMM-BASED SPEECH SYNTHESIS , 2008 .

[69]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[70]  Paavo Alku,et al.  Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[71]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[72]  Zhi-Jie Yan,et al.  RIch-context Unit Selection (RUS) approach to high quality TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[73]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[74]  Zhenhua Ling HMM-based Unit Selection Using F , 2006 .

[75]  Ren-Hua Wang,et al.  HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[76]  Zhizheng Wu,et al.  Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features , 2015, INTERSPEECH.

[77]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[78]  Zhizheng Wu,et al.  Deep neural network context embeddings for model selection in rich-context HMM synthesis , 2015, INTERSPEECH.

[79]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[80]  Srikanth Ronanki,et al.  The CSTR entry to the Blizzard Challenge 2016 , 2016 .

[81]  Frank K. Soong,et al.  A cross-language state mapping approach to bilingual (Mandarin-English) TTS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[82]  Michal Tadeusz Kaszczuk,et al.  The IVO Software Blizzard Challenge 2009 Entry: Improving IVONA Text-To-Speech , 2009 .

[83]  Simon King,et al.  Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis , 2004, IEEE Transactions on Audio, Speech, and Language Processing.

[84]  Paavo Alku,et al.  The GlottHMM Speech Synthesis Entry for Blizzard Challenge 2010 , 2010 .

[85]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[86]  Zhi-Jie Yan,et al.  Rich context modeling for high quality HMM-based TTS , 2009, INTERSPEECH.

[87]  Antonio Bonafonte,et al.  A Bilingual Spanish-Catalan Database of Units for Concatenative Synthesis , 1997 .

[88]  Junichi Yamagishi,et al.  A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis , 2015, INTERSPEECH.

[89]  Junichi Yamagishi,et al.  Utilization of an HMM-based feature generation module in 5 ms segment concatenative speech synthesis , 2007, SSW.

[90]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[91]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[92]  Paul Taylor Unifying unit selection and hidden Markov model speech synthesis , 2006, INTERSPEECH.

[93]  Oliver Watts,et al.  Knowledge versus data in TTS: evaluation of a continuum of synthesis systems , 2015, INTERSPEECH.

[94]  Philip C. Woodland,et al.  Automatic speech synthesiser parameter estimation using HMMs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[95]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[96]  Hirokazu Kameoka,et al.  Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models , 2013, SSW.

[97]  Simon King,et al.  Multisyn: Open-domain unit selection for the Festival speech synthesis system , 2007, Speech Commun..

[98]  Michal Tadeusz Kaszczuk,et al.  The IVO Software Blizzard 2007 Entry: Improving Ivona Speech Synthesis System , 2007 .

[99]  Simon King,et al.  An introduction to statistical parametric speech synthesis , 2011 .

[100]  Aimilios Chalamandaris,et al.  The ILSP / INNOETICS Text-to-Speech System for the Blizzard Challenge 2014 , 2013 .

[101]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[102]  Alan W. Black,et al.  Random forests for statistical speech synthesis , 2015, INTERSPEECH.

[103]  E. Paulus,et al.  Speech Signal Processing , 1997, The Electrical Engineering Handbook - Six Volume Set.

[104]  Heiga Zen,et al.  Directly modeling voiced and unvoiced components in speech waveforms by neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[105]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[106]  Simon King,et al.  Investigating the shortcomings of HMM synthesis , 2013, SSW.

[107]  Vincent Pollet,et al.  Psychoacoustic Segment Scoring for Multi-Form Speech Synthesis , 2012, INTERSPEECH.

[108]  Paavo Alku,et al.  Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[109]  Zhen-Hua Ling,et al.  DBN-based Spectral Feature Representation for Statistical Parametric Speech Synthesis , 2016, IEEE Signal Processing Letters.

[110]  Robert A. J. Clark,et al.  A multi-level representation of f0 using the continuous wavelet transform and the Discrete Cosine Transform , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[111]  Zhizheng Wu,et al.  Deep neural network-guided unit selection synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[112]  Bhuvana Ramabhadran,et al.  Using deep bidirectional recurrent neural networks for prosodic-target prediction in a unit-selection text-to-speech system , 2015, INTERSPEECH.

[113]  Heiga Zen,et al.  Decision tree-based context clustering based on cross validation and hierarchical priors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[114]  Michal Kaszczuk Evaluating Ivona Speech Synthesis System for Blizzard Challenge 2006 , 2006 .

[115]  Zhizheng Wu,et al.  From HMMS to DNNS: Where do the improvements come from? , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[116]  Vincent Pollet,et al.  Synthesis by generation and concatenation of multiform segments , 2008, INTERSPEECH.

[117]  David Suendermann,et al.  Challenges in Speech Synthesis , 2010 .

[118]  Paavo Alku,et al.  The GlottHMM Entry for Blizzard Challenge 2011: Utilizing Source Unit Selection in HMM-Based Speech Synthesis for Improved Excitation Generation , 2011 .

[119]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[120]  Stephen Isard,et al.  Optimal coupling of diphones , 1994, SSW.

[121]  Simon King,et al.  Using HMM-based Speech Synthesis to Reconstruct the Voice of Individuals with Degenerative Speech Disorders , 2012, INTERSPEECH.

[122]  Zhi-Jie Yan,et al.  An HMM trajectory tiling (HTT) approach to high quality TTS , 2010, INTERSPEECH.

[123]  吉村 貴克,et al.  Simultaneous modeling of phonetic and prosodic parameters,and characteristic conversion for HMM-based text-to-speech systems , 2002 .

[124]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[125]  Heiga Zen,et al.  Statistical parametric speech synthesis: from HMM to LSTM-RNN , 2015 .

[126]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[127]  Keiichi Tokuda,et al.  CELP coding based on mel-cepstral analysis , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[128]  Paavo Alku,et al.  Comparison of formant enhancement methods for HMM-based speech synthesis , 2010, SSW.

[129]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[130]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[131]  Simon King,et al.  Festival 2 - build your own general purpose unit selection speech synthesiser , 2004, SSW.

[132]  Phil Hoole,et al.  Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus , 2011, INTERSPEECH.

[133]  João P. Cabral HMM-based Speech Synthesis Using an Acoustic Glottal Source Model , 2011 .

[134]  Paavo Alku,et al.  Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise , 2014, Comput. Speech Lang..