MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets

In this paper, we provide a new perspective on self-supervised speech models from how the self-training targets are obtained. We generalize the targets extractor into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE). Based on this, we propose a new multi-tasking learning framework for self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. MT4SSL refers to two typical models, HuBERT and data2vec, which use the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with no need for that much data. Furthermore, we find that using both Off-TE and On-TE results in better convergence in the pre-training phase. With both effectiveness and efficiency, we think that doing multi-task learning on self-supervised speech models from our perspective is a promising trend.

[1]  Jinyu Li,et al.  Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training , 2022, INTERSPEECH.

[2]  Furu Wei,et al.  Speech Pre-training with Acoustic Piece , 2022, INTERSPEECH.

[3]  Michael Auli,et al.  data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[4]  Yao Qian,et al.  Self-Supervised Learning for speech recognition with Intermediate layer supervision , 2021, ArXiv.

[5]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jinyu Li,et al.  WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.

[7]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[8]  Chung-Cheng Chiu,et al.  w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Andy T. Liu,et al.  SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[11]  Xuedong Huang,et al.  UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data , 2021, ICML.

[12]  Shang-Wen Li,et al.  TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[14]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[15]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[16]  James R. Glass,et al.  Vector-Quantized Autoregressive Predictive Coding , 2020, INTERSPEECH.

[17]  Yoshua Bengio,et al.  Multi-Task Self-Supervised Learning for Robust Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Abdel-rahman Mohamed,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Katrin Kirchhoff,et al.  Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[21]  Hung-yi Lee,et al.  Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[23]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[24]  Yoshua Bengio,et al.  Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[25]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[26]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[29]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.