Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis

Long short-term memory recurrent neural networks (LSTM-RNNs) have been applied to various speech applications including acoustic modeling for statistical parametric speech synthesis. One of the concerns for applying them to text-to-speech applications is its effect on latency. To address this concern, this paper proposes a low-latency, streaming speech synthesis architecture using unidirectional LSTM-RNNs with a recurrent output layer. The use of unidirectional RNN architecture allows frame-synchronous streaming inference of output acoustic features given input linguistic features. The recurrent output layer further encourages smooth transition between acoustic features at consecutive frames. Experimental results in subjective listening tests show that the proposed architecture can synthesize natural sounding speech without requiring utterance-level batch processing.

[1]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Marcus Liwicki,et al.  A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks , 2007 .

[4]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[5]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[7]  Yannis Agiomyrgiannakis,et al.  Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Heiga Zen,et al.  Deep learning in speech synthesis , 2013, SSW.

[10]  Noel Massey,et al.  Text-to-speech conversion with neural networks: a recurrent TDNN approach , 1998, EUROSPEECH.

[11]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[12]  Mike Schuster,et al.  On supervised learning from sequential data with applications for speech regognition , 1999 .

[13]  Jingming Kuang,et al.  Low latency parameter generation for real-time speech synthesis system , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[14]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[15]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[16]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17]  William J. Byrne,et al.  Fast, low-artifact speech synthesis considering global variance , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[19]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[22]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Bhuvana Ramabhadran,et al.  Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks , 2014, INTERSPEECH.

[24]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[26]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Tony Robinson,et al.  Speech synthesis using artificial neural networks trained on cepstral coefficients , 1993, EUROSPEECH.

[28]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[29]  Bhuvana Ramabhadran,et al.  F0 contour prediction with a deep belief network-Gaussian process hybrid model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Heiga Zen,et al.  Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis , 2011, Speech Commun..

[31]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[32]  Heiga Zen,et al.  Autoregressive Models for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Anthony J. Robinson,et al.  Static and Dynamic Error Propagation Networks with Application to Speech Coding , 1987, NIPS.

[34]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[36]  S. King,et al.  Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis , 2013, SSW.

[37]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005 .