AudioLM: A Language Modeling Approach to Audio Generation

—We introduce AudioLM, a framework for high- quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

[1]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[2]  Devi Parikh,et al.  Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer , 2022, ECCV.

[3]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[4]  Marc van Zee,et al.  Scaling Up Models and Data with t5x and seqio , 2022, ArXiv.

[5]  Patrick von Platen,et al.  XTREME-S: Evaluating Cross-lingual Speech Representations , 2022, INTERSPEECH.

[6]  Benoît Sagot,et al.  Are Discrete Units Necessary for Spoken Language Modeling? , 2022, IEEE Journal of Selected Topics in Signal Processing.

[7]  Abdel-rahman Mohamed,et al.  textless-lib: a Library for Textless Spoken Language Processing , 2022, NAACL.

[8]  Oriol Vinyals,et al.  General-purpose, long-context autoregressive modeling with Perceiver AR , 2022, ICML.

[9]  David F. Harwath,et al.  Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling , 2022, ArXiv.

[10]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[11]  YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone , 2021, ArXiv.

[12]  Jing Yu Koh,et al.  Vector-quantized Image Modeling with Improved VQGAN , 2021, ICLR.

[13]  Abdel-rahman Mohamed,et al.  Text-Free Prosody-Aware Generative Spoken Language Modeling , 2021, ACL.

[14]  Marco Tagliasacchi,et al.  SoundStream: An End-to-End Neural Audio Codec , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[16]  Junichi Yamagishi,et al.  ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan , 2021, ArXiv.

[17]  Chung-Cheng Chiu,et al.  w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Benjamin van Niekerk,et al.  Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing , 2021, Interspeech.

[19]  Minje Kim,et al.  Harp-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding , 2021, 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[20]  Karen Livescu,et al.  Layer-wise Analysis of a Self-supervised Speech Representation Model , 2021, ArXiv.

[21]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[22]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Ruslan Salakhutdinov,et al.  Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training? , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Eugene Kharitonov,et al.  The Zero Resource Speech Challenge 2021: Spoken Language Modelling , 2021, Interspeech.

[25]  Emmanuel Dupoux,et al.  On Generative Spoken Language Modeling from Raw Audio , 2021, Transactions of the Association for Computational Linguistics.

[26]  Emmanuel Dupoux,et al.  Towards Unsupervised Learning of Speech Features in the Wild , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[27]  Rethinking Attention with Performers , 2020, ICLR.

[28]  Bryan Catanzaro,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[29]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[30]  Matthijs Douze,et al.  Data Augmenting Contrastive Learning of Speech Representations in the Time Domain , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[31]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.

[32]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[33]  Dominik Roblek,et al.  SEANet: A Multi-modal Speech Enhancement Network , 2020, INTERSPEECH.

[34]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[35]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[36]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[37]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[38]  Andrew Hines,et al.  ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric , 2020, 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX).

[39]  Abdel-rahman Mohamed,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Guillaume Lample,et al.  Deep Learning for Symbolic Mathematics , 2019, ICLR.

[41]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[42]  Michael Auli,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[43]  Erich Elsen,et al.  High Fidelity Speech Synthesis with Adversarial Networks , 2019, ICLR.

[44]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[45]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[46]  Minje Kim,et al.  Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding , 2019, INTERSPEECH.

[47]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[48]  Marco Tagliasacchi,et al.  Self-supervised audio representation learning for mobile devices , 2019, ArXiv.

[49]  Douglas Eck,et al.  Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset , 2018, ICLR.

[50]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[51]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[52]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[53]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[54]  Srihari Kankanahalli,et al.  End-To-End Optimized Speech Coding with Deep Neural Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[56]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[57]  Luca Benini,et al.  Soft-to-Hard Vector Quantization for End-to-End Learned Compression of Images and Neural Networks , 2017, ArXiv.

[58]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[59]  Thomas Schatz ABX-Discriminability Measures and Applications. (Mesures de discriminabilité ABX et applications) , 2016 .

[60]  Anil C. Kokaram,et al.  ViSQOL: an objective speech quality model , 2015, EURASIP J. Audio Speech Music. Process..

[61]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.