论文信息 - High-quality Speech Coding with Sample RNN

High-quality Speech Coding with Sample RNN

We provide a speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of state-of-the-art classic wide-band codecs. Moreover, it is demonstrated that the proposed scheme can provide a meaningful rate-distortion trade-off without retraining. We evaluate the proposed scheme in a series of listening tests and discuss limitations of the approach.

[1] Turaj Zakizadeh Shabestary,et al. Vector quantization by companding a union of Z-lattices , 2005, IEEE Transactions on Information Theory.

[2] Antonio Bonafonte,et al. Spanish Statistical Parametric Speech Synthesis Using a Neural Vocoder , 2018, INTERSPEECH.

[3] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4] Fredrik Nordén,et al. Entropy constrained quantization of LSP parameters , 2003, INTERSPEECH.

[5] Cong Zhou,et al. Voice Conversion with Conditional SampleRNN , 2018, INTERSPEECH.

[6] Kuldip K. Paliwal,et al. Efficient vector quantization of LPC parameters at 24 bits/frame , 1993, IEEE Trans. Speech Audio Process..

[7] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8] Anssi Rämö,et al. Voice quality evaluation of recent open source codecs , 2010, INTERSPEECH.

[9] Roch Lefebvre,et al. The adaptive multirate wideband speech codec (AMR-WB) , 2002, IEEE Trans. Speech Audio Process..

[10] Thomas Eriksson,et al. Improving predictive vector quantizers in speech coding applications , 1996 .

[11] Jing Peng,et al. An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[12] Xi Chen,et al. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[13] Koen Vos,et al. SILK Speech Codec , 2010 .

[14] Yoshua Bengio,et al. Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[15] Adam Coates,et al. Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[16] Quan Wang,et al. Wavenet Based Low Rate Speech Coding , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18] Geoffrey E. Hinton,et al. Training Recurrent Neural Networks , 2013 .

[19] Yoshua Bengio,et al. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[20] Allen Gersho,et al. Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[21] Cassia Valentini-Botinhao,et al. Noisy speech database for training speech enhancement algorithms and TTS models , 2017 .

[22] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[23] Zhen-Hua Ling,et al. Samplernn-Based Neural Vocoder for Statistical Parametric Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] P. Hedelin. A sinusoidal LPC vocoder , 2000, 2000 IEEE Workshop on Speech Coding. Proceedings. Meeting the Challenges of the New Millennium (Cat. No.00EX421).