论文信息 - Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes

Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes

A new speaker-adaptation technique for deep neural network (DNN)-based speech synthesis - which requires only speech data without orthographic transcriptions - is proposed. This technique is based on a DNN-based speech-synthesis model that takes speaker, gender, and age into consideration as additional inputs and outputs acoustic parameters of corresponding voices from text in order to construct a multi-speaker model and perform speaker adaptation. It uses a new input code that represents acoustic similarity to each of the training speakers in a probability. The new input code, called “speaker-similarity vector,” is obtained by concatenating posterior probabilities calculated from each model of the training speakers. GMM-UBM or i-vector/PLDA, which are widely used in text-independent speaker verification, are used to represent the speaker models, since they can be used without text information. Text and the speaker-similarity vectors of the training speakers are used as input to first train a multi-speaker speech-synthesis model, which outputs acoustic parameters of the training speakers. A new speaker-similarity vector is then estimated by using a small amount of speech data uttered by an unknown target speaker on the basis of the separately trained speaker models. It is expected that inputting the estimated speaker-similarity vector into the multi-speaker speech-synthesis model can generate synthetic speech that resembles the target speaker's voice. In objective and subjective experiments, adaptation performance of the proposed technique was evaluated using not only studio-quality adaptation data but also low-quality (i.e., noisy and reverberant) data. The results of the experiments indicate that the proposed technique makes it possible to rapidly construct a voice for the target speaker in DNN-based speech synthesis.

Junichi Yamagishi | Yoshikazu Nishimura | Shinji Takaki

[1] Kong-Aik Lee,et al. An extensible speaker identification sidekit in Python , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Junichi Yamagishi,et al. Adapting and controlling DNN-based speech synthesis using input codes , 2017, ICASSP.

[3] Zhizheng Wu,et al. A study of speaker adaptation for DNN-based speech synthesis , 2015, INTERSPEECH.

[4] Peter Vary,et al. Multichannel audio database in various acoustic environments , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[5] Patrick Kenny,et al. Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[6] Steve Renals,et al. Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[7] W. Marsden. I and J , 2012 .

[8] Lior Wolf,et al. Voice Synthesis for in-the-Wild Speakers via a Phonological Loop , 2017, ArXiv.

[9] Frank K. Soong,et al. Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Frank K. Soong,et al. Speaker and language factorization in DNN-based TTS synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Nobutaka Ito,et al. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[12] Masanori Morise,et al. CheapTrick, a spectral envelope estimator for high-quality speech synthesis , 2015, Speech Commun..

[13] Junichi Yamagishi,et al. Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis , 2016, SSW.

[14] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[15] Yusuke Ijima,et al. An Investigation of DNN-Based Speech Synthesis Using Speaker Codes , 2016, INTERSPEECH.

[16] Stephen Cox,et al. RecNorm: Simultaneous Normalisation and Classification Applied to Speech Recognition , 1990, NIPS.

[17] Ranniery Maia,et al. Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors , 2017, INTERSPEECH.

[18] Heiga Zen,et al. Unsupervised adaptation for HMM-based speech synthesis , 2008, INTERSPEECH.

[19] Tao Zhang,et al. Learning Spectral Mapping for Speech Dereverberation and Denoising , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] Sercan Ömer Arik,et al. Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[21] Nobuaki Minematsu,et al. Speaker Representations for Speaker Adaptation in Multiple Speakers' BLSTM-RNN-Based Speech Synthesis , 2016, INTERSPEECH.