论文信息 - Towards Achieving Robust Universal Neural Vocoding

Towards Achieving Robust Universal Neural Vocoding

This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality. When the recordings show significant changes in quality, or when moving towards non-speech vocalizations or singing, the vocoder still significantly outperforms speaker-dependent vocoders, but operates at a lower average relative MUSHRA of 75%. These results are shown to be consistent across languages, regardless of them being seen during training (e.g. English or Japanese) or unseen (e.g. Wolof, Swahili, Ahmaric).

[1] Haizhou Li,et al. A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder , 2018, INTERSPEECH.

[2] Gregory Diamos,et al. Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks , 2018, IEEE Signal Processing Letters.

[3] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4] Adam Finkelstein,et al. Fftnet: A Real-Time Speaker-Dependent Neural Vocoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Tomoki Toda,et al. Collapsed speech segment detection and suppression for WaveNet vocoder , 2018, INTERSPEECH.

[6] Paavo Alku,et al. Comparison of multiple voice source parameters in different phonation types , 2007, INTERSPEECH.

[7] Simon King,et al. The Blizzard Challenge 2008 , 2008 .

[8] Dong Yu,et al. Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis , 2018, INTERSPEECH.

[9] Cassia Valentini-Botinhao. Noisy reverberant speech database for training speech enhancement algorithms and TTS models , 2017 .

[10] Thierry Dutoit,et al. The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11] Antoine Liutkus,et al. The 2016 Signal Separation Evaluation Campaign , 2017, LVA/ICA.

[12] Masanori Morise,et al. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[13] Wei Ping,et al. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[14] Cassia Valentini-Botinhao. Reverberant speech database for training speech dereverberation algorithms and TTS models , 2016 .

[15] Mark A. Clements,et al. Speech concatenation and synthesis using an overlap-add sinusoidal model , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16] Hideki Kawahara,et al. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[17] Tomoki Toda,et al. An investigation of multi-speaker training for wavenet vocoder , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18] Jae Lim,et al. Signal estimation from modified short-time Fourier transform , 1984 .

[19] J. Liljencrants,et al. Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[20] Li-Rong Dai,et al. WaveNet Vocoder with Limited Training Data for Voice Conversion , 2018, INTERSPEECH.

[21] Simon King,et al. Investigating source and filter contributions, and their interaction, to statistical parametric speech synthesis , 2014, INTERSPEECH.

[22] Simon King,et al. Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Heiga Zen,et al. The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[24] Ryan Prenger,et al. Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Lauri Juvela,et al. A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Cassia Valentini-Botinhao,et al. Noisy speech database for training speech enhancement algorithms and TTS models , 2017 .

[27] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[28] Patrick Nguyen,et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[29] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Adam Nadolski,et al. Comprehensive Evaluation of Statistical Speech Waveform Synthesis , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[31] S. Scott,et al. When voices get emotional: A corpus of nonverbal vocalizations for research on emotion processing , 2013, Behavior research methods.

[32] Colin Raffel,et al. librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[33] Erich Elsen,et al. Efficient Neural Audio Synthesis , 2018, ICML.

[34] Laurent Besacier,et al. Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof , 2016, LREC.