FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion

This paper proposes a non-autoregressive extension of our previously proposed sequence-to-sequence (S2S) modelbased voice conversion (VC) methods. S2S model-based VC methods have attracted particular attention in recent years for their flexibility in converting not only the voice identity but also the pitch contour and local duration of input speech, thanks to the ability of the encoder-decoder architecture with the attention mechanism. However, one of the obstacles to making these methods work in real-time is the autoregressive (AR) structure. To overcome this obstacle, we develop a method to obtain a model that is free from an AR structure and behaves similarly to the original S2S models, based on a teacher-student learning framework. In our method, called “FastS2S-VC”, the student model consists of encoder, decoder, and attention predictor. The attention predictor learns to predict attention distributions solely from source speech along with a target class index with the guidance of those predicted by the teacher model from both source and target speech. Specifically, it is designed as a network that takes the mel-spectrogram of source speech as input and generates in parallel each row of an attention weight matrix represented by a Gaussian function. Thanks to the model structure that does not take the mel-spectrogram of target speech as input, the model is freed from an AR structure and allows for parallelization. Furthermore, we show that FastS2SVC is suitable for real-time implementation based on a slidingwindow approach, and describe how to make it run in real-time. Through speaker-identity and emotional-expression conversion experiments, we confirmed that FastS2S-VC was able to speed up the conversion process by 70 to 100 times compared to the original AR-type S2S-VC methods, without significantly degrading the audio quality and similarity to target speech. We also confirmed that the real-time version of FastS2S-VC can be run with a latency of 32 ms when run on a GPU.

[1]  Mike Lewis,et al.  MelNet: A Generative Model for Audio in the Frequency Domain , 2019, ArXiv.

[2]  Haizhou Li,et al.  An Exemplar-Based Approach to Frequency Warping for Voice Conversion , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Haizhou Li,et al.  Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion , 2016, INTERSPEECH.

[4]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[5]  Hirokazu Kameoka,et al.  ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion , 2018, ArXiv.

[6]  H. Kameoka,et al.  Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks , 2020, IEEE/ACM Transactions on Audio Speech and Language Processing.

[7]  Hirokazu Kameoka,et al.  Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining , 2019, INTERSPEECH.

[8]  Seyed Hamidreza Mohammadi,et al.  Voice conversion using deep neural networks with speaker-independent pre-training , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[9]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[10]  Kou Tanaka,et al.  ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[13]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[14]  Heiga Zen,et al.  Parallel Tacotron: Non-Autoregressive and Controllable TTS , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Simon King,et al.  An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Hirokazu Kameoka,et al.  CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[17]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18]  Tetsuya Takiguchi,et al.  High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion , 2014, INTERSPEECH.

[19]  Kou Tanaka,et al.  ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Kou Tanaka,et al.  VoiceGrad: Non-Parallel Any-to-Many Voice Conversion With Annealed Langevin Dynamics , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[22]  Li-Rong Dai,et al.  Sequence-to-Sequence Acoustic Modeling for Voice Conversion , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Shinnosuke Takamichi,et al.  Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Hideyuki Tachibana,et al.  Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Tomoki Toda,et al.  Non-Parallel Voice Conversion with Cyclic Variational Autoencoder , 2019, INTERSPEECH.

[26]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Hirokazu Kameoka,et al.  Many-to-Many Voice Transformer Network , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[29]  Tomoki Toda,et al.  sprocket: Open-Source Voice Conversion Software , 2018, Odyssey.

[30]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[31]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[32]  Soroosh Mariooryad,et al.  Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[34]  Haizhou Li,et al.  Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet , 2019, INTERSPEECH.

[35]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[36]  Lauri Juvela,et al.  Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[38]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[39]  Songxiang Liu,et al.  Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling , 2020, ArXiv.

[40]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[41]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[42]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[43]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Kou Tanaka,et al.  ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Dik J. Hermes,et al.  Measuring the perceptual similarity of pitch contours , 1995, EUROSPEECH.

[46]  Tetsuya Takiguchi,et al.  Exemplar-based voice conversion in noisy environment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[47]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[48]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[49]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[50]  Shinji Watanabe,et al.  Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition , 2019, ArXiv.

[51]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[52]  Fadi Biadsy,et al.  Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation , 2019, INTERSPEECH.

[53]  Haizhou Li,et al.  Sparse representation of phonetic features for voice conversion with and without parallel data , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[54]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[56]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[57]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[58]  Shinnosuke Takamichi,et al.  Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities , 2017, INTERSPEECH.

[59]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[60]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Joan Serra,et al.  Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion , 2019, NeurIPS.