论文信息 - Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks

Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks

This paper investigates the use of Deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks (DBLSTM-RNNs) for voice conversion. Temporal correlations across speech frames are not directly modeled in frame-based methods using conventional Deep Neural Networks (DNNs), which results in a limited quality of the converted speech. To improve the naturalness and continuity of the speech output in voice conversion, we propose a sequence-based conversion method using DBLSTM-RNNs to model not only the frame-wised relationship between the source and the target voice, but also the long-range context-dependencies in the acoustic trajectory. Experiments show that DBLSTM-RNNs outperform DNNs where Mean Opinion Scores are 3.2 and 2.3 respectively. Also, DBLSTM-RNNs without dynamic features have better performance than DNNs with dynamic features.

[1] Frank K. Soong,et al. TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[2] Jürgen Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[3] Eric Moulines,et al. Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[4] Andrew W. Senior,et al. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[5] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[6] Bayya Yegnanarayana,et al. Voice conversion , 1989, Speech Commun..

[7] Zhiwei Shuang,et al. Frequency warping based on mapping formant parameters , 2006, INTERSPEECH.

[8] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[9] Jürgen Schmidhuber,et al. LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.

[10] Navdeep Jaitly,et al. Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[11] Daniel Erro,et al. Weighted frequency warping for voice conversion , 2007, INTERSPEECH.

[12] Tetsuya Takiguchi,et al. Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[13] Björn W. Schuller,et al. Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[15] Tomoki Toda,et al. Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[16] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[17] Paul J. Werbos,et al. Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[18] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19] Li-Rong Dai,et al. Joint spectral distribution modeling using restricted boltzmann machines for voice conversion , 2013, INTERSPEECH.

[20] Kishore Prahallad,et al. Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Helen M. Meng,et al. Statistical parametric speech synthesis using weighted multi-distribution deep belief network , 2014, INTERSPEECH.

[22] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[23] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[24] Tomoki Toda,et al. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[25] Geoffrey E. Hinton,et al. Learning distributed representations of concepts. , 1989 .

[26] Li-Rong Dai,et al. Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27] Alan W. Black,et al. The CMU Arctic speech databases , 2004, SSW.

[28] Geoffrey E. Hinton,et al. Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[29] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[30] Helen M. Meng,et al. Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31] Heiga Zen,et al. Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[33] Björn W. Schuller,et al. Introducing CURRENNT: the munich open-source CUDA recurrent neural network toolkit , 2015, J. Mach. Learn. Res..

[34] Tetsuya Takiguchi,et al. High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion , 2014, INTERSPEECH.