Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora

In Voice Conversion (VC), the speech of a source speaker is modified to resemble that of a particular target speaker. Currently, standard VC approaches use Gaussian mixture model (GMM)-based transformations that do not generate high-quality converted speech due to “over-smoothing” resulting from weak links between individual source and target frame parameters. Dynamic Frequency Warping (DFW) offers an appealing alternative to GMM-based methods, as more spectral details are maintained in transformation; however, the speaker timbre is less successfully converted because spectral power is not adjusted explicitly. Previous work combines separate GMM- and DFW-transformed spectral envelopes for each frame. This paper proposes a more effective DFW-based approach that (1) does not rely on the baseline GMM methods, and (2) functions on the acoustic class level. To adjust spectral power, an amplitude scaling function is used that compares the average target and warped source log spectra for each acoustic class. The proposed DFW with Amplitude scaling (DFWA) outperforms standard GMM and hybrid GMM-DFW methods for VC in terms of both speech quality and timbre conversion, as is confirmed in extensive objective and subjective testing. Furthermore, by not requiring time-alignment of source and target speech, DFWA is able to perform equally well using parallel or nonparallel corpora, as is demonstrated explicitly.

[1]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[2]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[3]  K. Shikano,et al.  Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Levent M. Arslan,et al.  Robust processing techniques for voice conversion , 2006, Comput. Speech Lang..

[6]  Olivier Rosec,et al.  Glottal Closure Instant Estimation using an Appropriateness Measure of the Source and Continuity Constraints , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Levent M. Arslan,et al.  Speaker transformation using sentence HMM based alignments and detailed prosody modification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Qin Yan,et al.  Probability models of formant parameters for voice conversion , 2003, INTERSPEECH.

[9]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[10]  Daniel Erro Eslava Intra-lingual and cross-lingual voice conversion using harmonic plus stochastic models , 2008 .

[11]  O. Cappé,et al.  Regularized estimation of cepstrum envelope from discrete frequency points , 1995, Proceedings of 1995 Workshop on Applications of Signal Processing to Audio and Accoustics.

[12]  Li Deng,et al.  Adaptive Kalman Filtering and Smoothing for Tracking Vocal Tract Resonances Using a Continuous-Valued Hidden Dynamic Model , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Olivier Rosec,et al.  On transforming spectral peaks in voice conversion , 2010, SSW.

[14]  Satoshi Nakamura,et al.  Speaker adaptation and voice conversion by codebook mapping , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[15]  Olivier Rosec,et al.  Speech spectral envelope estimation through explicit control of peak evolution in time , 2010, 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010).

[16]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[17]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[18]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[21]  H. Ney,et al.  VTLN-based voice conversion , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[22]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[23]  Axel Röbel,et al.  Improving Lpc Spectral Envelope Extraction Of Voiced Speech By True-Envelope Estimation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[24]  Antonio Bonafonte,et al.  Including dynamic and phonetic information in voice conversion systems , 2004, INTERSPEECH.

[25]  Olivier Rosec,et al.  Alleviating the one-to-many mapping problem in voice conversion with context-dependent modeling , 2009, INTERSPEECH.

[26]  Athanasios Mouchtaris,et al.  Conditional Vector Quantization for Voice Conversion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[27]  Xia Wang,et al.  Text-independent voice conversion based on state mapped codebook , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.