F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential

This paper presents several F0 transformation techniques for statistical voice conversion (VC) with direct waveform modification with spectral differential (DIFFVC). Statistical VC is a technique to convert speaker identity of a source speaker's voice into that of a target speaker by converting several acoustic features, such as spectral and excitation features. This technique usually uses vocoder to generate converted speech waveforms from the converted acoustic features. However, the use of vocoder often causes speech quality degradation of the converted voice owing to insufficient parameterization accuracy. To avoid this issue, we have proposed a direct waveform modification technique based on spectral differential filtering and have successfully applied it to intra-gender singing VC (DIFFSVC) where excitation features are not necessary converted. Moreover, we have also applied it to cross-gender singing VC by implementing F0 transformation with a constant rate such as one octave increase or decrease. On the other hand, it is not straightforward to apply the DIFFSVC framework to normal speech conversion because the F0 transformation ratio widely varies depending on a combination of the source and target speakers. In this paper, we propose several F0 transformation techniques for DIFFVC and compare their performance in terms of speech quality of the converted voice and conversion accuracy of speaker individuality. The experimental results demonstrate that the F0 transformation technique based on waveform modification achieves the best performance among the proposed techniques.

[1]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[2]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[3]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[5]  Tetsuya Takiguchi,et al.  Exemplar-Based Voice Conversion Using Sparse Representation in Noisy Environments , 2013, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[6]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Banno Hideki,et al.  GMM voice conversion of singing voice using vocal tract area function , 2010 .

[8]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[9]  John H. L. Hansen,et al.  Speaker-specific pitch contour modeling and modification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[11]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[12]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[13]  Heiga Zen,et al.  Gaussian Process Experts for Voice Conversion , 2011, INTERSPEECH.

[14]  Tomoki Toda,et al.  Statistical singing voice conversion with direct waveform modification based on the spectrum differential , 2014, INTERSPEECH.

[15]  Inma Hernáez,et al.  Harmonics Plus Noise Model Based Vocoder for Statistical Parametric Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[16]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Tomoki Toda,et al.  Augmented speech production based on real-time statistical voice conversion , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[18]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Tomoki Toda,et al.  Statistical singing voice conversion based on direct waveform modification with global variance , 2015, INTERSPEECH.

[20]  Ning Xu,et al.  Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data , 2014, Speech Commun..

[21]  Zhizheng Wu,et al.  Analysis of the Voice Conversion Challenge 2016 Evaluation Results , 2016, INTERSPEECH.

[22]  Werner Verhelst,et al.  An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Tomoki Toda,et al.  The NU-NAIST Voice Conversion System for the Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[25]  Tomoki Toda,et al.  Implementation of F0 transformation for statistical singing voice conversion based on direct waveform modification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[27]  Tomoki Toda,et al.  Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[29]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..