Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition

End-to-end TTS model can directly take an utterance as reference, and generate speech from the text with prosody and speaker characteristics similar to the reference utterance. Ideally, the transcription of reference utterance does not need to match the text to be synthesized, so unsupervised style transfer can be achieved. However, in the previous model, because only the matched text and speech are used in training, given unmatched text and speech during testing would make the model synthesize blurry speech. In this paper, we propose to mitigate the problem by using the unmatched text and speech during training, and using the ASR accuracy of an end-to-end ASR model to guide the training procedure. The experimental results show that with the guidance of end-to-end ASR, both the ASR accuracy (objective evaluation) and the listener preference (subjective evaluation) of the speech generated by TTS model are improved. Moreover, we propose attention consistency loss as regularization, which is shown to accelerate the training.

[1]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[2]  Shinnosuke Takamichi,et al.  Training algorithm to deceive Anti-Spoofing Verification for DNN-based speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yoav Goldberg,et al.  Sequence to Sequence Transduction with Hard Monotonic Attention , 2016, ArXiv.

[5]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[6]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[7]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[9]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[10]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Fabio Tesser,et al.  Experiments with signal-driven symbolic prosody for statistical parametric speech synthesis , 2013, SSW.

[12]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Samy Bengio,et al.  An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.

[14]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[15]  Lauri Juvela,et al.  Generative Adversarial Network-Based Glottal Waveform Model for Statistical Parametric Speech Synthesis , 2017, INTERSPEECH.

[16]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[17]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Junichi Yamagishi,et al.  Adapting and controlling DNN-based speech synthesis using input codes , 2017, ICASSP.

[19]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[20]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[22]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[23]  Mark J. F. Gales,et al.  Unsupervised clustering of emotion and voice styles for expressive TTS , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Lior Wolf,et al.  VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.

[25]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Colin Raffel,et al.  Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[28]  Matt Shannon,et al.  Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping , 2017, INTERSPEECH.

[29]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[30]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[31]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[32]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[33]  Manfred K. Warmuth,et al.  THE CMU SPHINX-4 SPEECH RECOGNITION SYSTEM , 2001 .

[34]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Satoshi Nakamura,et al.  Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[36]  Igor Jauk,et al.  Unsupervised Learning for Expressive Speech Synthesis , 2017, IberSPEECH.

[37]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[38]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[39]  Li-Rong Dai,et al.  Speaker verification against synthetic speech , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[40]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.