Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems

In this paper, we report research results on modeling the parameters of an improved time-frequency trajectory excitation (ITFTE) and spectral envelopes of an LPC vocoder with a long short-term memory (LSTM)-based recurrent neural network (RNN) for high-quality text-to-speech (TTS) systems. The ITFTE vocoder has been shown to significantly improve the perceptual quality of statistical parameter-based TTS systems in our prior works. However, a simple feed-forward deep neural network (DNN) with a finite window length is inadequate to capture the time evolution of the ITFTE parameters. We propose to use the LSTM to exploit the time-varying nature of both trajectories of the excitation and filter parameters, where the LSTM is implemented to use the linguistic text input and to predict both ITFTE and LPC parameters holistically. In the case of LPC parameters, we further enhance the generated spectrum by applying LP bandwidth expansion and line spectral frequency-sharpening filters. These filters are not only beneficial for reducing unstable synthesis filter conditions but also advantageous toward minimizing the muffling problem in the generated spectrum. Experimental results have shown that the proposed LSTM-RNN system with the ITFTE vocoder significantly outperforms both similarly configured band aperiodicity-based systems and our best prior DNN-trainecounterpart, both objectively and subjectively.

[1]  Joseph P. Campbell,et al.  The Dod 4.8 Kbps Standard (Proposed Federal Standard 1016) , 1991 .

[2]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[3]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Hong-Goo Kang,et al.  Enhancement of spectral clarity for HMM-based text-to-speech systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Thomas P. Barnwell,et al.  MCCREE AND BARNWELL MIXED EXCITAmON LPC VOCODER MODEL LPC SYNTHESIS FILTER 243 SYNTHESIZED SPEECH-PERIODIC PULSE TRAIN-1 PERIODIC POSITION JITTER PULSE 4 , 2004 .

[6]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Zhizheng Wu,et al.  From HMMS to DNNS: Where do the improvements come from? , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Hong-Goo Kang,et al.  Improved time-frequency trajectory excitation modeling for a statistical parametric speech synthesis system , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yu Hongzhi,et al.  Research on HMM_based speech synthesis for Lhasa dialect , 2011, 2011 International Conference on Image Analysis and Signal Processing.

[11]  Eddie L. T. Choy,et al.  Waveform Interpolation Speech Coder at 4 kb/s , 1998 .

[12]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[15]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[16]  Suryakanth V. Gangashetty,et al.  An investigation of recurrent neural network architectures for statistical parametric speech synthesis , 2015, INTERSPEECH.

[17]  Allen Gersho,et al.  Real-time vector APC speech coding at 4800 bps with adaptive postfiltering , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Willem Bastiaan Kleijn,et al.  Continuous representations in linear predictive coding , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Frank K. Soong,et al.  Improved Time-Frequency Trajectory Excitation Vocoder for DNN-Based Speech Synthesis , 2016, INTERSPEECH.

[22]  Anthony J. Robinson,et al.  Static and Dynamic Error Propagation Networks with Application to Speech Coding , 1987, NIPS.

[23]  Hong-Goo Kang,et al.  Deep neural network-based statistical parametric speech synthesis system using improved time-frequency trajectory excitation model , 2015, INTERSPEECH.

[24]  Yoshihiko Nankaku,et al.  The effect of neural networks in statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[26]  Hong-Goo Kang,et al.  Multi-class learning algorithm for deep neural network-based statistical parametric speech synthesis , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[27]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[28]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[29]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[30]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[31]  Fumitada Itakura,et al.  Speech analysis and synthesis methods developed at ECL in NTT - From LPC to LSP - , 1986, Speech Commun..

[32]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[33]  Biing-Hwang Juang,et al.  Line spectrum pair (LSP) and speech data compression , 1984, ICASSP.

[34]  Masanori Morise,et al.  D4C, a band-aperiodicity estimator for high-quality speech synthesis , 2016, Speech Commun..

[35]  W. Bastiaan Kleijn,et al.  A speech coder based on decomposition of characteristic waveforms , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[36]  Hideki Kawahara,et al.  Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006, Blizzard Challenge.

[38]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.