Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion

We propose a nonparametric framework for voice conversion, that is, exemplar-based sparse representation with residual compensation. In this framework, a spectrogram is reconstructed as a weighted linear combination of speech segments, called exemplars, which span multiple consecutive frames. The linear combination weights are constrained to be sparse to avoid over-smoothing, and high-resolution spectra are employed in the exemplars directly without dimensionality reduction to maintain spectral details. In addition, a spectral compression factor and a residual compensation technique are included in the framework to enhance the conversion performances. We conducted experiments on the VOICES database to compare the proposed method with a large set of state-of-the-art baseline methods, including the maximum likelihood Gaussian mixture model (ML-GMM) with dynamic feature constraint and the partial least squares (PLS) regression based methods. The experimental results show that the objective spectral distortion of ML-GMM is reduced from 5.19 dB to 4.92 dB, and both the subjective mean opinion score and the speaker identification rate are increased from 2.49 and 73.50% to 3.15 and 79.50%, respectively, by the proposed method. The results also show the superiority of our method over PLS-based methods. In addition, the subjective listening tests indicate that the naturalness of the converted speech by our proposed method is comparable with that by the ML-GMM method with global variance constraint.

[1]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[2]  Cha Zhang,et al.  CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Heiga Zen,et al.  Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[5]  Hui Ye,et al.  Quality-enhanced voice morphing using maximum likelihood transformations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[8]  Hyung Soon Kim,et al.  Narrowband to wideband conversion of speech using GMM based transformation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Satoshi Nakamura,et al.  Speaker adaptation and voice conversion by codebook mapping , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[11]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Tomoki Toda,et al.  Voice conversion for various types of body transmitted speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[14]  Xiaodong Cui,et al.  Stereo-Based Stochastic Mapping for Robust Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[16]  Olivier Rosec,et al.  Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[18]  Athanasios Mouchtaris,et al.  A Spectral Conversion Approach to Single-Channel Speech Enhancement , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Paris Smaragdis,et al.  Optimal cost function and magnitude power for NMF-based speech separation and music interpolation , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[20]  Juhan Nam,et al.  A super-resolution spectrogram using coupled PLCA , 2010, INTERSPEECH.

[21]  Haizhou Li,et al.  Exemplar-based voice conversion using non-negative spectrogram deconvolution , 2013, SSW.

[22]  Jiang-She Zhang,et al.  Large margin based nonnegative matrix factorization and partial least squares regression for face recognition , 2011, Pattern Recognit. Lett..

[23]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[24]  Peng Song,et al.  Voice conversion using support vector regression , 2011 .

[25]  H. Ney,et al.  VTLN-based cross-language voice conversion , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[26]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[27]  Simon King,et al.  Measuring the Gap Between HMM-Based ASR and TTS , 2010, IEEE Journal of Selected Topics in Signal Processing.

[28]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[29]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  H. Zen,et al.  Continuous Stochastic Feature Mapping Based on Trajectory HMMs , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Inma Hernáez,et al.  Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Tetsuya Takiguchi,et al.  Exemplar-based voice conversion in noisy environment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[33]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[34]  Thierry Dutoit,et al.  Towards a Voice Conversion System Based on Frame Selection , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[35]  H. Ney,et al.  VTLN-based voice conversion , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[36]  Hermann Ney,et al.  Text-Independent Voice Conversion Based on Unit Selection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[37]  Tuomas Virtanen,et al.  Non-negative matrix deconvolution in noise robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Tuomas Virtanen,et al.  Unsupervised Learning Methods for Source Separation in Monaural Music Signals , 2006 .

[39]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[40]  Xiaodong Cui,et al.  Stereo-based stochastic mapping with context using probabilistic PCA for noise robust automatic speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Hui Ye,et al.  Perceptually weighted linear transformations for voice conversion , 2003, INTERSPEECH.

[43]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[44]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[45]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Li-Rong Dai,et al.  Joint spectral distribution modeling using restricted boltzmann machines for voice conversion , 2013, INTERSPEECH.

[47]  Maria Klara Wolters,et al.  Evaluating speech synthesis intelligibility using Amazon Mechanical Turk , 2010, SSW.

[48]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[49]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Biing-Hwang Juang,et al.  Line spectrum pair (LSP) and speech data compression , 1984, ICASSP.

[51]  Sabine Buchholz,et al.  Crowdsourcing Preference Tests, and How to Detect Cheating , 2011, INTERSPEECH.

[52]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[53]  Xindong Wu,et al.  A new descriptive clustering algorithm based on Nonnegative Matrix Factorization , 2008, 2008 IEEE International Conference on Granular Computing.

[54]  Bhiksha Raj,et al.  Bandwidth expansion of narrowband speech using non-negative matrix factorization , 2005, INTERSPEECH.

[55]  Moncef Gabbouj,et al.  Local linear transformation for voice conversion , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).