Analysing Shortcomings of Statistical Parametric Speech Synthesis

Output from statistical parametric speech synthesis (SPSS) remains noticeably worse than natural speech recordings in terms of quality, naturalness, speaker similarity, and intelligibility in noise. There are many hypotheses regarding the origins of these shortcomings, but these hypotheses are often kept vague and presented without empirical evidence that could confirm and quantify how a specific shortcoming contributes to imperfections in the synthesised speech. Throughout speech synthesis literature, surprisingly little work is dedicated towards identifying the perceptually most important problems in speech synthesis, even though such knowledge would be of great value for creating better SPSS systems. In this book chapter, we analyse some of the shortcomings of SPSS. In particular, we discuss issues with vocoding and present a general methodology for quantifying the effect of any of the many assumptions and design choices that hold SPSS back. The methodology is accompanied by an example that carefully measures and compares the severity of perceptual limitations imposed by vocoding as well as other factors such as the statistical model and its use.

[1]  Heiga Zen,et al.  Directly modeling voiced and unvoiced components in speech waveforms by neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Zhizheng Wu,et al.  Improving Trajectory Modelling for DNN-Based Speech Synthesis by Using Stacked Bottleneck Features and Minimum Generation Error Training , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Hiroshi Ishiguro,et al.  Analysis of the Roles and the Dynamics of Breathy and Whispery Voice Qualities in Dialogue Speech , 2010, EURASIP J. Audio Speech Music. Process..

[4]  G. Fant Dept. for Speech, Music and Hearing Quarterly Progress and Status Report the Lf-model Revisited. Transformations and Frequency Domain Analysis the Lf-model Revisited. Transformations and Frequency Domain Analysis* , 2022 .

[5]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[7]  Zhizheng Wu,et al.  From HMMS to DNNS: Where do the improvements come from? , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[9]  Junichi Yamagishi,et al.  Initial investigation of speech synthesis based on complex-valued neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[12]  Oliver Watts,et al.  Letter-based speech synthesis , 2010, SSW.

[13]  Kai Yu,et al.  Multi-task joint-learning of deep neural networks for robust speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  Xin Wang,et al.  Enhance the Word Vector with Prosodic Information for the Recurrent Neural Network Based TTS System , 2016, INTERSPEECH.

[15]  Li-Rong Dai,et al.  Minimum Kullback–Leibler Divergence Parameter Generation for HMM-Based Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[17]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[18]  Heiga Zen,et al.  Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Moncef Gabbouj,et al.  Ways to Implement Global Variance in Statistical Speech Synthesis , 2012, INTERSPEECH.

[20]  Dong Yu,et al.  Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Cassia Valentini-Botinhao,et al.  Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Junichi Yamagishi,et al.  Multiple feed-forward deep neural networks for statistical parametric speech synthesis , 2015, INTERSPEECH.

[23]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[24]  Oliver Watts,et al.  Evaluating comprehension of natural and synthetic conversational speech , 2016 .

[25]  Qiguang Lin,et al.  Glottal source‐vocal tract acoustic interaction , 1987 .

[26]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[27]  Mikko Kurimo,et al.  Noise in HMM-Based Speech Synthesis Adaptation: Analysis, Evaluation Methods and Experiments , 2014, IEEE Journal of Selected Topics in Signal Processing.

[28]  Keiichi Tokuda,et al.  A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks , 2016, INTERSPEECH.

[29]  Yoshihiko Nankaku,et al.  The effect of neural networks in statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Simon King,et al.  Robustness of HMM-based speech synthesis , 2008, INTERSPEECH.

[31]  Hugo Larochelle,et al.  A Deep and Tractable Density Estimator , 2013, ICML.

[32]  William J. Byrne,et al.  Fast, low-artifact speech synthesis considering global variance , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Cassia Valentini-Botinhao,et al.  Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations , 2015, INTERSPEECH.

[34]  Takashi Nose,et al.  HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker's Voice Using Model Adaptation , 2009, IEICE Trans. Inf. Syst..

[35]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[36]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[37]  Srikanth Ronanki,et al.  Median-based generation of synthetic speech durations using a non-parametric approach , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[38]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[40]  Yoshihiko Nankaku,et al.  Temporal modeling in neural network based statistical parametric speech synthesis , 2016, SSW.

[41]  Bhuvana Ramabhadran,et al.  Using continuous lexical embeddings to improve symbolic-prosody prediction in a text-to-speech front-end , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Simon King,et al.  Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech , 2014, INTERSPEECH.

[43]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  V. Ramamoorthy,et al.  Enhancement of ADPCM speech by adaptive postfiltering , 1984, AT&T Bell Laboratories Technical Journal.

[45]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Heiga Zen,et al.  The Effect of Using Normalized Models in Statistical Speech Synthesis , 2011, INTERSPEECH.

[47]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005, Systems and Computers in Japan.

[48]  H M Hanson,et al.  Glottal characteristics of female speakers: acoustic correlates. , 1997, The Journal of the Acoustical Society of America.

[49]  John Kane,et al.  Improved automatic detection of creak , 2013, Comput. Speech Lang..

[50]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[51]  Junichi Yamagishi,et al.  A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis , 2015, INTERSPEECH.

[52]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[53]  Charles Kemp,et al.  How to Grow a Mind: Statistics, Structure, and Abstraction , 2011, Science.

[54]  Zhi-Jie Yan,et al.  Rich context modeling for high quality HMM-based TTS , 2009, INTERSPEECH.

[55]  Takashi Nose,et al.  Statistical Parametric Speech Synthesis Based on Gaussian Process Regression , 2014, IEEE Journal of Selected Topics in Signal Processing.

[56]  Xia Wang,et al.  Improving HMM Based Speech Synthesis by Reducing Over-Smoothing Problems , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[57]  Yamato Ohtani,et al.  Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Vincent Pollet,et al.  Synthesis by generation and concatenation of multiform segments , 2008, INTERSPEECH.

[59]  Paavo Alku,et al.  HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[60]  Takashi Nose,et al.  Efficient Implementation of Global Variance Compensation for Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[61]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[62]  Roger K. Moore A Bayesian explanation of the ‘Uncanny Valley’ effect and related psychological phenomena , 2012, Scientific Reports.

[63]  Xin Wang,et al.  A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora , 2016, SSW.

[64]  Shuang Xu,et al.  Gating recurrent mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[66]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[67]  Bajibabu Bollepalli,et al.  High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.