Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, typical training algorithms for these controllable sequence generative models suffer from the training-inference mismatch, where the same sample is used as content and style input during training but different samples are given during inference. In this paper, we tackle the training-inference mismatch encountered during unsupervised learning of controllable generative sequence models. By introducing a style transformation module that we call style equalization, we enable training using different content and style samples and thereby mitigate the traininginference mismatch. To demonstrate its generality, we applied style equalization to text-to-speech and text-to-handwriting synthesis on three datasets. Our models achieve state-of-the-art style replication with a similar mean style opinion score as the real data. Moreover, the proposed method enables style interpolation between sequences and generates novel styles.1

[1]  Thomas Hofmann,et al.  Controlling Style and Semantics in Weakly-Supervised Image Generation , 2019, ECCV.

[2]  Stefano Ermon,et al.  Improved Autoregressive Modeling with Distribution Smoothing , 2021, ICLR.

[3]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[4]  Nicholas W. D. Evans,et al.  Spoofing countermeasures to protect automatic speaker verification from voice conversion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[6]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Wenbin Cai,et al.  Separating Style and Content for Generalized Style Transfer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[13]  Sungwon Kim,et al.  Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search , 2020, NeurIPS.

[14]  Kou Tanaka,et al.  StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion , 2019, INTERSPEECH.

[15]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Junichi Yamagishi,et al.  Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[17]  M. Villegas,et al.  GANwriting: Content-Conditioned Generation of Styled Handwritten Word Images , 2020, ECCV.

[18]  Tomi Kinnunen,et al.  Spoofing and countermeasures for automatic speaker verification , 2013, INTERSPEECH.

[19]  Bjorn Ommer,et al.  A Disentangling Invertible Interpretation Network for Explaining Latent Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Tao Qin,et al.  AdaSpeech: Adaptive Text to Speech for Custom Voice , 2021, ICLR.

[21]  Ganesh Sivaraman,et al.  Generalization of Audio Deepfake Detection , 2020, Odyssey.

[22]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[23]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[24]  Yong Jae Lee,et al.  FineGAN: Unsupervised Hierarchical Disentanglement for Fine-Grained Object Generation and Discovery , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[26]  Richard Zhang,et al.  Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[27]  Stefanie Tellex,et al.  Generating Handwriting via Decoupled Style Descriptors , 2020, ECCV.

[28]  M. Hasegawa-Johnson,et al.  Unsupervised Speech Decomposition via Triple Information Bottleneck , 2020, ICML.

[29]  Guillaume Lample,et al.  Fader Networks: Manipulating Images by Sliding Attributes , 2017, NIPS.

[30]  Ryan Prenger,et al.  Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Hyrum S. Anderson,et al.  The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation , 2018, ArXiv.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Raja Bala,et al.  Editing in Style: Uncovering the Local Semantics of GANs , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Kou Tanaka,et al.  Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Douglas A. Reynolds,et al.  Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  K. Simonyan,et al.  End-to-End Adversarial Text-to-Speech , 2020, ICLR.

[37]  Siwei Lyu,et al.  Deepfake Detection: Current Challenges and Next Steps , 2020, 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[38]  Bjorn Ommer,et al.  Unsupervised Robust Disentangling of Latent Characteristics for Image Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[40]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Bolei Zhou,et al.  Interpreting the Latent Space of GANs for Semantic Face Editing , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Harshad Rai,et al.  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .

[43]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92) , 2019 .

[44]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[45]  Otmar Hilliges,et al.  DeepWriting: Making Digital Ink Editable via Deep Generative Modeling , 2018, CHI.

[46]  Brian L. Price,et al.  Text and Style Conditioned GAN for the Generation of Offline-Handwriting Lines , 2020, BMVC.

[47]  Aleksandr Sizov,et al.  ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge , 2017, IEEE Journal of Selected Topics in Signal Processing.

[48]  Leon A. Gatys,et al.  A Neural Algorithm of Artistic Style , 2015, ArXiv.