The NU-NAIST Voice Conversion System for the Voice Conversion Challenge 2016

This paper presents the NU-NAIST voice conversion (VC) system for the Voice Conversion Challenge 2016 (VCC 2016) developed by a joint team of Nagoya University and Nara Institute of Science and Technology. Statistical VC based on a Gaussian mixture model makes it possible to convert speaker identity of a source speaker’ voice into that of a target speaker by converting several speech parameters. However, various factors such as parameterization errors and over-smoothing effects usually cause speech quality degradation of the converted voice. To address this issue, we have proposed a direct waveform modification technique based on spectral differential filtering and have successfully applied it to singing voice conversion where excitation features are not necessary converted. In this paper, we propose a method to apply this technique to a standard voice conversion task where excitation feature conversion is needed. The result of VCC 2016 demonstrates that the NU-NAIST VC system developed by the proposed method yields the best conversion accuracy for speaker identity (more than 70% of the correct rate) and quite high naturalness score (more than 3 of the mean opinion score). This paper presents detail descriptions of the NU-NAIST VC system and additional results of its performance evaluation.

[1]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[2]  Tomoki Toda,et al.  Implementation of F0 transformation for statistical singing voice conversion based on direct waveform modification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[4]  Tetsuya Takiguchi,et al.  Exemplar-Based Voice Conversion Using Sparse Representation in Noisy Environments , 2013, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[5]  Tomoki Toda,et al.  Modulation spectrum-constrained trajectory training algorithm for GMM-based Voice Conversion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tomoki Toda,et al.  Speech Parameter Generation Algorithm Considering Modulation Spectrum for Statistical Parametric Speech Synthesis , 2015 .

[7]  T. Toda,et al.  The NAIST Text-to-Speech System for the Blizzard Challenge 2015 , 2015, The Blizzard Challenge 2015.

[8]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[9]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[10]  Inma Hernáez,et al.  Harmonics Plus Noise Model Based Vocoder for Statistical Parametric Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[11]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Heiga Zen,et al.  Gaussian Process Experts for Voice Conversion , 2011, INTERSPEECH.

[13]  Tomoki Toda,et al.  Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Tomoki Toda,et al.  Statistical singing voice conversion based on direct waveform modification with global variance , 2015, INTERSPEECH.

[16]  Werner Verhelst,et al.  An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[19]  Ning Xu,et al.  Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data , 2014, Speech Commun..

[20]  Tomoki Toda,et al.  Statistical singing voice conversion with direct waveform modification based on the spectrum differential , 2014, INTERSPEECH.

[21]  Tomoki Toda,et al.  Augmented speech production based on real-time statistical voice conversion , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[22]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[23]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[24]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[25]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[26]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).