VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019

We describe our submitted system for the ZeroSpeech Challenge 2019. The current challenge theme addresses the difficulty of constructing a speech synthesizer without any text or phonetic labels and requires a system that can (1) discover subword units in an unsupervised way, and (2) synthesize the speech with a target speaker's voice. Moreover, the system should also balance the discrimination score ABX, the bit-rate compression rate, and the naturalness and the intelligibility of the constructed voice. To tackle these problems and achieve the best trade-off, we utilize a vector quantized variational autoencoder (VQ-VAE) and a multi-scale codebook-to-spectrogram (Code2Spec) inverter trained by mean square error and adversarial loss. The VQ-VAE extracts the speech to a latent space, forces itself to map it into the nearest codebook and produces compressed representation. Next, the inverter generates a magnitude spectrogram to the target voice, given the codebook vectors from VQ-VAE. In our experiments, we also investigated several other clustering algorithms, including K-Means and GMM, and compared them with the VQ-VAE result on ABX scores and bit rates. Our proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.

[1]  Satoshi Nakamura,et al.  Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[3]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[4]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Hirokazu Kameoka,et al.  Generative adversarial network-based postfilter for statistical parametric speech synthesis , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Bin Ma,et al.  Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study , 2015, INTERSPEECH.

[8]  Satoshi Nakamura,et al.  Optimizing DPGMM Clustering in Zero Resource Setting Based on Functional Load , 2018, SLTU.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[11]  Junichi Yamagishi,et al.  High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Satoshi Nakamura,et al.  Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project , 2008, IJCNLP.

[13]  Satoshi Nakamura,et al.  End-to-end Feedback Loss in Speech Chain Framework via Straight-through Estimator , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[15]  Satoshi Nakamura,et al.  Machine Speech Chain with One-shot Speaker Adaptation , 2018, INTERSPEECH.

[16]  Satoshi Nakamura,et al.  Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  S. Sakti,et al.  Development of HMM-based Indonesian Speech Synthesis , 2008 .

[18]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[19]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[20]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[24]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[25]  Aren Jansen,et al.  The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[26]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2019: TTS without T , 2019, INTERSPEECH.

[27]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Biing-Hwang Juang,et al.  Cycle-Consistent Speech Enhancement , 2018, INTERSPEECH.

[29]  Shinnosuke Takamichi,et al.  Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.