Reducing Mismatch in Training of DNN-Based Glottal Excitation Models in a Statistical Parametric Text-to-Speech System

Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis. Until now, however, these models have been trained separately from the acoustic model. This creates mismatch between training and synthesis, as the synthesized acoustic features used for the excitation model input differ from the original inputs, with which the model was trained on. Furthermore, due to the errors in predicting the vocal tract filter, the original excitation waveforms do not provide perfect reconstruction of the speech waveform even if predicted without error. To address these issues and to make the excitation model more robust against errors in acoustic modeling, this paper proposes two modifications to the excitation model training scheme. First, the excitation model is trained in a connected manner, with inputs generated by the acoustic model. Second, the target glottal waveforms are re-estimated by performing glottal inverse filtering with the predicted vocal tract filters. The results show that both of these modifications improve performance measured in MSE and MFCC distortion, and slightly improve the subjective quality of the synthetic speech.

[1]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[2]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Bajibabu Bollepalli,et al.  GlottDNN - A Full-Band Glottal Vocoder for Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[4]  Heiga Zen,et al.  Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Lauri Juvela,et al.  Using Text and Acoustic Features in Predicting Glottal Excitation Waveforms for Parametric Speech Synthesis with Recurrent Neural Networks , 2016, INTERSPEECH.

[6]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[7]  Hirokazu Kameoka,et al.  Direct Modeling of Frequency Spectra and Waveform Generation Based on Phase Recovery for DNN-Based Speech Synthesis , 2017, INTERSPEECH.

[8]  Susan Fitt,et al.  Robust LTS rules with the Combilex speech technology lexicon , 2009, INTERSPEECH.

[9]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[10]  Paavo Alku,et al.  Comparison of formant enhancement methods for HMM-based speech synthesis , 2010, SSW.

[11]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[12]  Eric Moulines,et al.  Non-parametric techniques for pitch-scale and time-scale modification of speech , 1995, Speech Commun..

[13]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Paavo Alku,et al.  Quasi Closed Phase Glottal Inverse Filtering Analysis With Weighted Linear Prediction , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Bajibabu Bollepalli,et al.  High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Björn W. Schuller,et al.  Introducing CURRENNT: the munich open-source CUDA recurrent neural network toolkit , 2015, J. Mach. Learn. Res..

[17]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[18]  Alan W. Black,et al.  Flite: a small fast run-time synthesis engine , 2001, SSW.

[19]  Paavo Alku,et al.  HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Junichi Yamagishi,et al.  A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Junichi Yamagishi,et al.  An autoregressive recurrent mixture density network for parametric speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[23]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[24]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[25]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[26]  Heiga Zen,et al.  Directly modeling voiced and unvoiced components in speech waveforms by neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[28]  Hirokazu Kameoka,et al.  Generative adversarial network-based postfilter for statistical parametric speech synthesis , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.