Textually Pretrained Speech Language Models

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. Speech samples can be found on our website: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .

[1]  Oskar van der Wal,et al.  Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , 2023, ArXiv.

[2]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[3]  O. Pietquin,et al.  SingSong: Generating musical accompaniments from singing , 2023, ArXiv.

[4]  Timo I. Denk,et al.  MusicLM: Generating Music From Text , 2023, ArXiv.

[5]  W. Freeman,et al.  Muse: Text-To-Image Generation via Masked Generative Transformers , 2023, ICML.

[6]  Yossi Adi,et al.  Analysing Discrete Self Supervised Speech Representation for Spoken Language Modeling , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  B. Ramabhadran,et al.  Maestro-U: Leveraging Joint Speech-Text Representation Learning for Zero Supervised Speech ASR , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[8]  Benoît Sagot,et al.  Generative Spoken Dialogue Language Modeling , 2022, TACL.

[9]  Yossi Adi,et al.  ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement , 2022, ArXiv.

[10]  Ankur Bapna,et al.  Mu2SLAM: Multitask, Multilingual Speech and Language Models , 2022, ArXiv.

[11]  Yossi Adi,et al.  Speaking Style Conversion With Discrete Self-Supervised Units , 2022, ArXiv.

[12]  Jong Wook Kim,et al.  Robust Speech Recognition via Large-Scale Weak Supervision , 2022, ICML.

[13]  Alexander M. Rush,et al.  BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[14]  Gabriel Synnaeve,et al.  High Fidelity Neural Audio Compression , 2022, ArXiv.

[15]  Yaniv Taigman,et al.  AudioGen: Textually Guided Audio Generation , 2022, ICLR.

[16]  David Grangier,et al.  AudioLM: A Language Modeling Approach to Audio Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Akhilesh Deepak Gotmare,et al.  CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning , 2022, NeurIPS.

[18]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[19]  Akshat Gupta On Building Spoken Language Understanding Systems for Low Resourced Languages , 2022, SIGMORPHON.

[20]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[21]  Yossi Adi,et al.  Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation , 2022, INTERSPEECH.

[22]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[23]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[24]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[25]  Abdel-rahman Mohamed,et al.  textless-lib: a Library for Textless Spoken Language Processing , 2022, NAACL.

[26]  Ankur Bapna,et al.  mSLAM: Massively multilingual joint pre-training for speech and text , 2022, ArXiv.

[27]  H. Schwenk,et al.  Textless Speech-to-Speech Translation on Real Data , 2021, NAACL.

[28]  Rui Wang,et al.  SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing , 2021, ACL.

[29]  Abdel-rahman Mohamed,et al.  Text-Free Prosody-Aware Generative Spoken Language Modeling , 2021, ACL.

[30]  A. Polyak,et al.  Direct Speech-to-Speech Translation With Discrete Units , 2021, ACL.

[31]  Hung-yi Lee,et al.  Recent Advances in Pre-trained Language Models: Why Do They Work and How Do They Work , 2022, AACL.

[32]  S. Savarese,et al.  A Conversational Paradigm for Program Synthesis , 2022, ArXiv.

[33]  Marco Tagliasacchi,et al.  SoundStream: An End-to-End Neural Audio Codec , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Vijay Janapa Reddi,et al.  The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage , 2021, NeurIPS Datasets and Benchmarks.

[35]  Ankur Bapna,et al.  SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training , 2021, ArXiv.

[36]  Adam Polyak,et al.  fairseq Sˆ2: A Scalable and Integrable Speech Synthesis Toolkit , 2021, EMNLP.

[37]  Chung-Cheng Chiu,et al.  w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[38]  Benjamin van Niekerk,et al.  Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing , 2021, Interspeech.

[39]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Eugene Kharitonov,et al.  Speech Resynthesis from Discrete Disentangled Self-Supervised Representations , 2021, Interspeech.

[41]  Libo Qin,et al.  A Survey on Spoken Language Understanding: Recent Advances and New Frontiers , 2021, IJCAI.

[42]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[43]  Emmanuel Dupoux,et al.  On Generative Spoken Language Modeling from Raw Audio , 2021, Transactions of the Association for Computational Linguistics.

[44]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[45]  Tie-Yan Liu,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2020, ICLR.

[46]  Eugene Kharitonov,et al.  Textless Speech Emotion Conversion using Decomposed and Discrete Representations , 2021, ArXiv.

[47]  Ewan Dunbar,et al.  The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling , 2020, ArXiv.

[48]  Verena Rieser,et al.  SLURP: A Spoken Language Understanding Resource Package , 2020, EMNLP.

[49]  Gabriel Synnaeve,et al.  MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[50]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[51]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[52]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[53]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[54]  Satoshi Nakamura,et al.  Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge , 2020, INTERSPEECH.

[55]  Abdel-rahman Mohamed,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[57]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[58]  Haizhou Li,et al.  VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019 , 2019, INTERSPEECH.

[59]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[61]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[62]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[63]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[64]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[65]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Nathanael Chambers,et al.  A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.

[67]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[68]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[69]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Cha Zhang,et al.  CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[72]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[73]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .