SUPERB: Speech processing Universal PERformance Benchmark

Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL for its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing.

[1]  Hung-yi Lee,et al.  Fragmentvc: Any-To-Any Voice Conversion by End-To-End Extracting and Fusing Fine-Grained Voice Fragments with Attention , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Shang-Wen Li,et al.  TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Florian Metze,et al.  QUESST2014: Evaluating Query-by-Example Speech Search in a zero-resource setting with real-life queries , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Hung-yi Lee,et al.  Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Aren Jansen,et al.  Towards Learning a Universal Non-Semantic Representation of Speech , 2020, INTERSPEECH.

[6]  Francesco Caltagirone,et al.  Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[7]  Abhinav Gupta,et al.  Scaling and Benchmarking Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[9]  James R. Glass,et al.  Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies , 2020, Interspeech.

[10]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[11]  Ewan Dunbar,et al.  The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling , 2020, ArXiv.

[12]  Jia Deng,et al.  How Useful Is Self-Supervised Pretraining for Visual Tasks? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yoshua Bengio,et al.  Multi-Task Self-Supervised Learning for Robust Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[15]  Shang-Wen Li,et al.  Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[19]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[20]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Toni Giorgino,et al.  Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package , 2009 .

[22]  Yannick Estève,et al.  Recent Advances in End-to-End Spoken Language Understanding , 2019, SLSP.

[23]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[24]  James R. Glass,et al.  Vector-Quantized Autoregressive Predictive Coding , 2020, INTERSPEECH.

[25]  Antoine Deleforge,et al.  LibriMix: An Open-Source Dataset for Generalizable Speech Separation , 2020, 2005.11262.

[26]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[27]  Yuzong Liu,et al.  DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization , 2020, ArXiv.

[28]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[29]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[30]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[31]  Mireia Díez,et al.  GTTS-EHU Systems for QUESST at MediaEval 2014 , 2014, MediaEval.

[32]  Meng Li,et al.  Exploring wav2vec 2.0 on speaker verification and language identification , 2020, Interspeech.

[33]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[34]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[35]  Yoshua Bengio,et al.  Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[36]  Antonio Bonafonte,et al.  Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN , 2019, ArXiv.

[37]  Ruslan Salakhutdinov,et al.  Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training? , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Gabriel Synnaeve,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Armand Joulin,et al.  Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[43]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[44]  WAV 2 VEC 2 . 0 ON SPEAKER VERIFICATION AND LANGUAGE IDENTIFICATION , 2021 .