Voice Conversion for TTS Systems with Tuning on the Target Speaker Based on GMM

The paper is devoted to improving the methods of voice conversion (VC) for developing text-to-speech synthesis systems with capabilities of tuning on the target speaker. Such system with VC module in acoustic processor, parametric representation of speech database for concatenative synthesis based on instantaneous harmonic representation is presented in the paper. Voice conversion is based on multiple regression mapping function and Gaussian mixture model (GMM), the method of text-independent learning is based on hidden Markov models and modified Viterbi algorithm. Experimental evaluation of the proposed solutions in terms of naturalness and similarity is presented as well.

[1]  Yannis Agiomyrgiannakis,et al.  Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  W. Bastiaan Kleijn,et al.  A Canonical Representation of Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Elias Azarov,et al.  Instantaneous harmonic representation of speech using multicomponent sinusoidal excitation , 2013, INTERSPEECH.

[4]  David Griol,et al.  The Conversational Interface: Talking to Smart Devices , 2016 .

[5]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[6]  Seyed Hamidreza Mohammadi,et al.  An overview of voice conversion systems , 2017, Speech Commun..

[7]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[8]  Raymond N. J. Veldhuis,et al.  Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[9]  Barry Kirkpatrick,et al.  Feature transformation applied to the detection of discontinuities in concatenated speech , 2007, SSW.

[10]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[11]  Simon King,et al.  Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis , 2004, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Elias Azarov,et al.  Guslar: A framework for automated singing voice correction , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[14]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[15]  Thierry Dutoit,et al.  Text-to-Speech Synthesis , 2005 .

[16]  Kiyohiro Shikano,et al.  Speaker adaptation through vector quantization , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.