A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis

A vocoder is used to express a speech waveform with a controllable parametric representation that can be converted back into a speech waveform. Vocoders representing their main categories (mixed excitation, glottal, and sinusoidal vocoders) were compared in this study with formal and crowd-sourced listening tests. The vocoder quality was measured within the context of analysis–synthesis as well as text-to-speech (TTS) synthesis in a modern statistical parametric speech synthesis framework. Furthermore, the TTS experiments were divided into synthesis with vocoder-specific features and synthesis with a shared envelope model, where the waveform generation method of the vocoders is mainly responsible for the quality differences. Finally, all of the tests included four distinct voices as a way to investigate the effect of different speakers on the synthesized speech quality. The obtained results suggest that the choice of the voice has a profound impact on the overall quality of the vocoder-generated speech, and the best vocoder for each voice can vary case by case. The single best-rated TTS system was obtained with the glottal vocoder GlottDNN using a male voice with low expressiveness. However, the results indicate that the sinusoidal vocoder PML (pulse model in log-domain) has the best overall performance across the performed tests. Finally, when controlling for the spectral models of the vocoders, the observed differences are similar to the baseline results. This indicates that the waveform generation method of a vocoder is essential for quality improvements.

[1]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[2]  Lauri Juvela,et al.  Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort , 2014, INTERSPEECH.

[3]  Susan Fitt,et al.  On generating combilex pronunciations via morphological analysis , 2010, INTERSPEECH.

[4]  Paavo Alku,et al.  Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[5]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005, Systems and Computers in Japan.

[6]  Cassia Valentini-Botinhao,et al.  Hurricane natural speech corpus , 2013 .

[7]  Simon King,et al.  The listening talker: A review of human and algorithmic context-induced modifications of speech , 2014, Comput. Speech Lang..

[8]  Vincent Pollet,et al.  Uniform Speech Parameterization for Multi-Form Segment Synthesis , 2011, INTERSPEECH.

[9]  Mark J. F. Gales,et al.  A Pulse Model in Log-domain for a Uniform Synthesizer , 2016, SSW.

[10]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[11]  Yves Kamp,et al.  Robust signal selection for linear prediction analysis of voiced speech , 1993, Speech Commun..

[12]  Bajibabu Bollepalli,et al.  High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Slava Shechtman,et al.  Semi Parametric Concatenative TTS with Instant Voice Modification Capabilities , 2017, INTERSPEECH.

[14]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[15]  Paavo Alku,et al.  Effects of Training Data Variety in Generating Glottal Pulses from Acoustic Features with DNNs , 2017, INTERSPEECH.

[16]  Thomas P. Barnwell,et al.  MCCREE AND BARNWELL MIXED EXCITAmON LPC VOCODER MODEL LPC SYNTHESIS FILTER 243 SYNTHESIZED SPEECH-PERIODIC PULSE TRAIN-1 PERIODIC POSITION JITTER PULSE 4 , 2004 .

[17]  Bajibabu Bollepalli,et al.  Glottal Vocoding With Frequency-Warped Time-Weighted Linear Prediction , 2017, IEEE Signal Processing Letters.

[18]  Tom Bäckström,et al.  Speech Coding: with Code-Excited Linear Prediction , 2017 .

[19]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[20]  Logan Volkers,et al.  PHASE VOCODER , 2008 .

[21]  Hirokazu Kameoka,et al.  Generative adversarial network-based postfilter for statistical parametric speech synthesis , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[23]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Daniel Erro,et al.  A uniform phase representation for the harmonic model in speech synthesis applications , 2014, EURASIP J. Audio Speech Music. Process..

[25]  Bhuvana Ramabhadran,et al.  Bias and Statistical Significance in Evaluating Speech Synthesis with Mean Opinion Scores , 2017, INTERSPEECH.

[26]  P. Alku,et al.  Formant frequency estimation of high-pitched vowels using weighted linear prediction. , 2013, The Journal of the Acoustical Society of America.

[27]  Unto K. Laine,et al.  A comparison of warped and conventional linear predictive coding , 2001, IEEE Trans. Speech Audio Process..

[28]  Paavo Alku,et al.  Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise , 2014, Comput. Speech Lang..

[29]  Junichi Yamagishi,et al.  An experimental comparison of multiple vocoder types , 2013, SSW.

[30]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[31]  서종수,et al.  四季 引 festival , 2009 .

[32]  Paavo Alku,et al.  Quasi Closed Phase Glottal Inverse Filtering Analysis With Weighted Linear Prediction , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[34]  Perry R. Cook,et al.  Toward the Perfect Audio Morph? Singing Voice Synthesis and Processing , 2007 .

[35]  Paavo Alku,et al.  Wideband Parametric Speech Synthesis Using Warped Linear Prediction , 2012, INTERSPEECH.

[36]  Paavo Alku,et al.  HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  H. Strube Linear prediction on a warped frequency scale , 1980 .

[38]  Junichi Yamagishi,et al.  Glottal Spectral Separation for Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[39]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[40]  Lauri Juvela,et al.  Using Text and Acoustic Features in Predicting Glottal Excitation Waveforms for Parametric Speech Synthesis with Recurrent Neural Networks , 2016, INTERSPEECH.

[41]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[42]  Nicolas Sturmel,et al.  Phase-Based Methods for Voice Source Analysis , 2007, NOLISP.

[43]  METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .

[44]  Paavo Alku,et al.  Comparison of formant enhancement methods for HMM-based speech synthesis , 2010, SSW.

[45]  Julius O. Smith,et al.  Bark and ERB bilinear transforms , 1999, IEEE Trans. Speech Audio Process..

[46]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[47]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[48]  Eric Moulines,et al.  Non-parametric techniques for pitch-scale and time-scale modification of speech , 1995, Speech Commun..

[49]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[50]  Yannis Agiomyrgiannakis,et al.  Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Zhizheng Wu,et al.  Investigating gated recurrent neural networks for speech synthesis , 2016, ArXiv.

[52]  Inma Hernáez,et al.  Harmonics Plus Noise Model Based Vocoder for Statistical Parametric Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[53]  Bajibabu Bollepalli,et al.  GlottDNN - A Full-Band Glottal Vocoder for Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[54]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[55]  Thierry Dutoit,et al.  The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[56]  Paavo Alku,et al.  Voice source modelling using deep neural networks for statistical parametric speech synthesis , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[57]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[58]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..