论文信息 - Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models.

[1] Srihari Kankanahalli,et al. End-To-End Optimized Speech Coding with Deep Neural Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Abeer Alwan,et al. Speech Coding: Fundamentals and Applications , 2003 .

[3] Quan Wang,et al. Wavenet Based Low Rate Speech Coding , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Fumitada Itakura. Early developments of LPC speech coding techniques , 1990, ICSLP.

[5] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[6] Luca Benini,et al. Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations , 2017, NIPS.

[7] Michael Keyhl,et al. Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I-Temporal Alignment , 2013 .

[8] James L. Flanagan,et al. Adaptive quantization in differential PCM coding of speech , 1973 .

[9] W. Bastiaan Kleijn,et al. Rate Distribution Between Model and Signal , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[10] Biing-Hwang Juang,et al. Line spectrum pair (LSP) and speech data compression , 1984, ICASSP.

[11] Minje Kim,et al. Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding , 2019, INTERSPEECH.

[12] Hirokazu Kameoka,et al. Progress in LPC-based frequency-domain audio coding , 2016 .

[13] Thomas C. Walters,et al. Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] J.D. Gibson,et al. Speech coding methods, standards, and applications , 2005, IEEE Circuits and Systems Magazine.

[16] Abhinav Kumar,et al. Study and Performance of AMR Codecs for GSM , 2014 .

[17] Erich Elsen,et al. Efficient Neural Audio Synthesis , 2018, ICML.

[18] ITU-T Rec. G.722.2 (07/2003) Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB) , 2004 .

[19] Timothy B. Terriberry,et al. High-Quality, Low-Delay Music Coding in the Opus Codec , 2016, ArXiv.

[20] Jean-Marc Valin,et al. Speex: A Free Codec For Free Speech , 2016, ArXiv.

[21] D. O'Shaughnessy,et al. Linear predictive coding , 1988, IEEE Potentials.

[22] Andries P. Hekstra,et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[23] Gerhard Stoll,et al. ISO-MPEG-1 Audio: A Generic Standard for Coding of High-: Quality Digital Audio , 1994 .

[24] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[25] Jan Skoglund,et al. LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Jan Skoglund,et al. A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet , 2019, INTERSPEECH.

[27] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Roch Lefebvre,et al. The adaptive multirate wideband speech codec (AMR-WB) , 2002, IEEE Trans. Speech Audio Process..

[29] Andreas Spanias,et al. Speech coding: a tutorial review , 1994, Proc. IEEE.