Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models.

[1]  Srihari Kankanahalli,et al.  End-To-End Optimized Speech Coding with Deep Neural Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Abeer Alwan,et al.  Speech Coding: Fundamentals and Applications , 2003 .

[3]  Quan Wang,et al.  Wavenet Based Low Rate Speech Coding , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Fumitada Itakura Early developments of LPC speech coding techniques , 1990, ICSLP.

[5]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[6]  Luca Benini,et al.  Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations , 2017, NIPS.

[7]  Michael Keyhl,et al.  Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I-Temporal Alignment , 2013 .

[8]  James L. Flanagan,et al.  Adaptive quantization in differential PCM coding of speech , 1973 .

[9]  W. Bastiaan Kleijn,et al.  Rate Distribution Between Model and Signal , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[10]  Biing-Hwang Juang,et al.  Line spectrum pair (LSP) and speech data compression , 1984, ICASSP.

[11]  Minje Kim,et al.  Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding , 2019, INTERSPEECH.

[12]  Hirokazu Kameoka,et al.  Progress in LPC-based frequency-domain audio coding , 2016 .

[13]  Thomas C. Walters,et al.  Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  J.D. Gibson,et al.  Speech coding methods, standards, and applications , 2005, IEEE Circuits and Systems Magazine.

[16]  Abhinav Kumar,et al.  Study and Performance of AMR Codecs for GSM , 2014 .

[17]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[18]  ITU-T Rec. G.722.2 (07/2003) Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB) , 2004 .

[19]  Timothy B. Terriberry,et al.  High-Quality, Low-Delay Music Coding in the Opus Codec , 2016, ArXiv.

[20]  Jean-Marc Valin,et al.  Speex: A Free Codec For Free Speech , 2016, ArXiv.

[21]  D. O'Shaughnessy,et al.  Linear predictive coding , 1988, IEEE Potentials.

[22]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[23]  Gerhard Stoll,et al.  ISO-MPEG-1 Audio: A Generic Standard for Coding of High-: Quality Digital Audio , 1994 .

[24]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[25]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Jan Skoglund,et al.  A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet , 2019, INTERSPEECH.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Roch Lefebvre,et al.  The adaptive multirate wideband speech codec (AMR-WB) , 2002, IEEE Trans. Speech Audio Process..

[29]  Andreas Spanias,et al.  Speech coding: a tutorial review , 1994, Proc. IEEE.