论文信息 - MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets

MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets

In this paper, we provide a new perspective on self-supervised speech models from how the self-training targets are obtained. We generalize the targets extractor into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE). Based on this, we propose a new multi-tasking learning framework for self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. MT4SSL refers to two typical models, HuBERT and data2vec, which use the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with no need for that much data. Furthermore, we find that using both Off-TE and On-TE results in better convergence in the pre-training phase. With both effectiveness and efficiency, we think that doing multi-task learning on self-supervised speech models from our perspective is a promising trend.

Ziyang Ma | Xie Chen | Zhisheng Zheng | Changli Tang | Yujin Wang

[1] Jinyu Li,et al. Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training , 2022, INTERSPEECH.

[2] Furu Wei,et al. Speech Pre-training with Acoustic Piece , 2022, INTERSPEECH.

[3] Michael Auli,et al. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[4] Yao Qian,et al. Self-Supervised Learning for speech recognition with Intermediate layer supervision , 2021, ArXiv.

[5] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Jinyu Li,et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.

[7] Li Dong,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[8] Chung-Cheng Chiu,et al. w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10] Andy T. Liu,et al. SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[11] Xuedong Huang,et al. UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data , 2021, ICML.

[12] Shang-Wen Li,et al. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[14] Pierre H. Richemond,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[15] Quoc V. Le,et al. Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[16] James R. Glass,et al. Vector-Quantized Autoregressive Predictive Coding , 2020, INTERSPEECH.

[17] Yoshua Bengio,et al. Multi-Task Self-Supervised Learning for Robust Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Abdel-rahman Mohamed,et al. Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Katrin Kirchhoff,et al. Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Alexei Baevski,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[21] Hung-yi Lee,et al. Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[23] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[24] Yoshua Bengio,et al. Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[25] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[26] Ron J. Weiss,et al. Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[29] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.