Voice conversion based on empirical conditional distribution in resource-limited scenarios

In this paper, a computationally efficient voice conversion system has been designed in order to improve the performance in resource-limited scenarios. First, mixtures of Gaussians (MoGs) at fixed locations of Mel frequencies have been used to represent the spectrum of STRAIGHT compactly. Second, the key conditional distributions for prediction are approximated by building histograms of aligned features empirically. Experiments have confirmed that our proposed method can obtain fairly good results compared to the traditional method without huge computational costs.

[1]  Kaamran Raahemifar,et al.  Speech analysis/synthesis by Gaussian mixture approximation of the speech spectrum for voice conversion , 2013, IEEE International Symposium on Signal Processing and Information Technology.

[2]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.