LMCodec: A Low Bitrate Speech Codec with Causal Transformer Models

We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec.

[1]  David Grangier,et al.  AudioLM: A Language Modeling Approach to Audio Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  W. Kleijn,et al.  Ultra-Low-Bitrate Speech Coding with Pretrained Transformers , 2022, INTERSPEECH.

[3]  J. Yamagishi,et al.  Generalization Ability of MOS Prediction Networks , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Marco Tagliasacchi,et al.  SoundStream: An End-to-End Neural Audio Codec , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Minje Kim,et al.  Harp-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding , 2021, 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[6]  Eugene Kharitonov,et al.  Speech Resynthesis from Discrete Disentangled Self-Supervised Representations , 2021, Interspeech.

[7]  Andrew Hines,et al.  Warp-Q: Quality Prediction for Generative Neural Speech Codecs , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[9]  Andrew Hines,et al.  ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric , 2020, 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX).

[10]  Abdel-rahman Mohamed,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92) , 2019 .

[12]  Minje Kim,et al.  Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding , 2019, INTERSPEECH.

[13]  Thomas C. Walters,et al.  Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[16]  Srihari Kankanahalli,et al.  End-To-End Optimized Speech Coding with Deep Neural Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jean-Marc Valin,et al.  A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement , 2017, 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP).

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Zhe Wang,et al.  Overview of the EVS codec architecture , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jodi Kearns,et al.  LibriVox: Free Public Domain Audiobooks , 2014 .

[22]  Timothy B. Terriberry,et al.  Definition of the Opus Audio Codec , 2012, RFC.

[23]  Shigeo Morishima,et al.  Speech coding based on a multi-layer neural network , 1990, IEEE International Conference on Communications, Including Supercomm Technical Sessions.