Voice Morphing that improves TTS quality using an optimal dynamic frequency warping-and-weighting transform

Dynamic Frequency Warping (DFW) is widely used to align spectra of different speakers. It has long been argued that frequency warping captures inter-speaker differences but DFW practice always involves a tricky preprocessing part to remove spectral tilt. The DFW residual is successfully used in Voice Morphing to improve the quality and the similarity of synthesized speech but the estimation of the DFW residual remains largely heuristic and sub-optimal. This paper presents a dynamic programming algorithm that simultaneously estimates the Optimal Frequency Warping and Weighting transform (ODFWW) and therefore needs no preprocessing step and fine-tuning while source/target-speaker data are matched using the Matching-Minimization algorithm [1]. The transform is used to morph the output of a state-of-the-art Vocaine-based [2] TTS synthesizer in order to generate different voices in runtime with only +8% computational overhead. Some morphed TTS voices exhibit significantly higher quality than the original one as morphing seems to "correct" the voice characteristics of the TTS voice.

[1]  Hui Liang,et al.  VTLN-Based Rapid Cross-Lingual Adaptation for Statistical Parametric Speech Synthesis , 2012 .

[2]  Orhan Karaali,et al.  Speech Synthesis with Neural Networks , 1998, ArXiv.

[3]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[4]  Yannis Agiomyrgiannakis The matching-minimization algorithm, the INCA algorithm and a mathematical framework for voice conversion with unaligned corpora , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Moncef Gabbouj,et al.  Voice conversion for non-parallel datasets using dynamic kernel partial least squares regression , 2013, INTERSPEECH.

[6]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Inma Hernáez,et al.  Towards Physically Interpretable Parametric Voice Conversion Functions , 2013, NOLISP.

[8]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[9]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Tomoki Toda,et al.  One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Hui Ye,et al.  High quality voice morphing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Zhizheng Wu,et al.  A study of speaker adaptation for DNN-based speech synthesis , 2015, INTERSPEECH.

[14]  Hui Liang,et al.  Implementation of VTLN for statistical speech synthesis , 2010, SSW.

[15]  Daniel Erro,et al.  Weighted frequency warping for voice conversion , 2007, INTERSPEECH.

[16]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[18]  Yannis Agiomyrgiannakis,et al.  Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Olivier Rosec,et al.  Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[21]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[22]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[24]  Inma Hernáez,et al.  Improving the Quality of Standard GMM-Based Voice Conversion Systems by Considering Physically Motivated Linear Transformations , 2012, IberSPEECH.

[25]  Tetsuya Takiguchi,et al.  Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[26]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[28]  Sadaoki Furui,et al.  Automatic speech summarization applied to English broadcast news speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[30]  Hui Liang,et al.  VTLN adaptation for statistical speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.