Voice Transformation Using Two-Level Dynamic Warping

Voice transformation, for example, from a male speaker to a female speaker, is achieved here using a two-level dynamic warping algorithm in conjunction with an artificial neural network. An outer warping process which temporally aligns blocks of speech (dynamic time warp, DTW) invokes an inner warping process, which spectrally aligns based on magnitude spectra (dynamic frequency warp, DFW). The mapping function produced by inner dynamic frequency warp is used to move spectral information from a source speaker to a target speaker. Artifacts arising from this amplitude spectral mapping are reduced by reconstructing phase information. Information obtained by this process is used to train an artificial neural network to produce spectral warping information based on spectral input data. The performance of the speech mapping compared using Mel-Cepstral Distortion (MCD) with previous voice transformation research, and it is shown to perform better than other methods, based on their reported MCD scores.

[1]  Nobuaki Minematsu,et al.  Voice conversion based on matrix variate Gaussian mixture model , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[2]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Haizhou Li,et al.  An Exemplar-Based Approach to Frequency Warping for Voice Conversion , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Ning Xu,et al.  Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data , 2014, Speech Commun..

[5]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Simon King,et al.  An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Olivier Rosec,et al.  Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Hadas Benisty,et al.  Voice Conversion Using GMM with Enhanced Global Variance , 2011, INTERSPEECH.

[9]  Yoshihisa Ishida,et al.  Transformation of spectral envelope for voice conversion based on radial basis function networks , 2002, INTERSPEECH.

[10]  Thomas W. Parsons,et al.  Voice and Speech Processing , 1986 .

[11]  Heiga Zen,et al.  Gaussian Process Experts for Voice Conversion , 2011, INTERSPEECH.

[12]  Junichi Yamagishi,et al.  Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama's voice using GAN, WaveNet and low-quality found data , 2018, Odyssey.

[13]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Levent M. Arslan,et al.  Application of voice conversion for cross-language rap singing transformation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[16]  Seyed Hamidreza Mohammadi,et al.  Transmutative voice conversion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Tomoki Toda,et al.  Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Heiga Zen,et al.  Probabilistic feature mapping based on trajectory HMMs , 2008, INTERSPEECH.

[19]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[20]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21]  Haizhou Li,et al.  Group Sparse Representation With WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[23]  Junichi Yamagishi,et al.  High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  B. Yegnanarayana,et al.  Voice conversion: Factors responsible for quality , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[26]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[27]  S. Kumaresan,et al.  A survey on the evolution of various voice conversion techniques , 2016, 2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS).

[28]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[29]  Zhizheng Wu,et al.  On the use of I-vectors and average voice model for voice conversion without parallel data , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[30]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  K. Shikano,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[32]  Inma Hernáez,et al.  Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Haifeng Li,et al.  A KL Divergence and DNN-Based Approach to Voice Conversion without Parallel Training Sentences , 2016, INTERSPEECH.

[34]  Arthur R. Toth,et al.  Using articulatory position data in voice transformation , 2007, SSW.

[35]  Haizhou Li,et al.  DeepConversion: Voice conversion with limited parallel training data , 2020, Speech Commun..