Phone-aware LSTM-RNN for voice conversion

This paper investigates a new voice conversion technique using phone-aware Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs). Most existing voice conversion methods, including Joint Density Gaussian Mixture Models (JDGMMs), Deep Neural Networks (DNNs) and Bidirectional Long Short-Term Memory Recurrent Neural Networks (BLSTM-RNNs), only take acoustic information of speech as features to train models. We propose to incorporate linguistic information to build voice conversion system by using monophones generated by a speech recognizer as linguistic features. The monophones and spectral features are combined together to train LSTM-RNN based voice conversion models, reinforcing the context-dependency modelling of LSTM-RNNs. The results of the 1st voice conversion challenge shows our system achieves significantly higher performance than baseline (GMM method) and was found among the most competitive scores in similarity test. Meanwhile, the experimental results show phone-aware LSTM-RNN method obtains lower Mel-cepstral distortion and higher MOS scores than the baseline LSTM-RNNs.

[1]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[4]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[5]  Antonio Bonafonte,et al.  Including dynamic and phonetic information in voice conversion systems , 2004, INTERSPEECH.

[6]  Xia Wang,et al.  Supervisory Data Alignment for Text-Independent Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Tetsuya Takiguchi,et al.  High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion , 2014, INTERSPEECH.

[8]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[10]  Zixiang Wang,et al.  A novel voice conversion system based on codebook mapping with phoneme-tied weighting , 2004, INTERSPEECH.

[11]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[13]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[15]  Bayya Yegnanarayana,et al.  Voice conversion , 1989, Speech Commun..

[16]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[17]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[18]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[19]  Yibiao Yu,et al.  Voice conversion using deep neural network in super-frame feature space , 2015, 2015 Sixth International Conference on Intelligent Control and Information Processing (ICICIP).

[20]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.