论文信息 - Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder

Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder

In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits perceptual quality which is around halfway between the MELP codec at 2.4 kbps and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality recorded speech with the test speaker included in the training set, a model coding speech at 1.6 kbps produces output of similar perceptual quality to that generated by AMR-WB at 23.05 kbps.

[1] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[2] Luca Benini,et al. Soft-to-Hard Vector Quantization for End-to-End Learned Compression of Images and Neural Networks , 2017, ArXiv.

[3] Srihari Kankanahalli,et al. End-To-End Optimized Speech Coding with Deep Neural Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Richard C. Hendriks,et al. On the information rate of speech communication , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Patrick Nguyen,et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[7] Valero Laparra,et al. End-to-end Optimized Image Compression , 2016, ICLR.

[8] Jean-Marc Valin,et al. Speex: A Free Codec For Free Speech , 2016, ArXiv.

[9] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[10] Roch Lefebvre,et al. The adaptive multirate wideband speech codec (AMR-WB) , 2002, IEEE Trans. Speech Audio Process..

[11] Milos Cernak,et al. Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[13] Lubomir D. Bourdev,et al. Real-Time Adaptive Image Compression , 2017, ICML.

[14] Lucas Theis,et al. Lossy Image Compression with Compressive Autoencoders , 2017, ICLR.

[15] METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .

[16] Thomas P. Barnwell,et al. A 2.4 kbit/s MELP coder candidate for the new U.S. Federal Standard , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17] Heiga Zen,et al. Sample Efficient Adaptive Text-to-Speech , 2018, ICLR.

[18] Quan Wang,et al. Wavenet Based Low Rate Speech Coding , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).