Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition

Self-supervised pre-training has dramatically improved the performance of automatic speech recognition (ASR). However, most existing self-supervised pre-training approaches are task-agnostic, i.e., could be applied to various downstream tasks. And there is a gap between the task-agnostic pretraining and the task-specific downstream fine-tuning, which may degrade the downstream performance. In this work, we propose a novel pre-training paradigm called wav2vec-S, where we use task-specific semi-supervised pre-training to bridge this gap. Specifically, the semi-supervised pre-training is conducted on the basis of self-supervised pre-training such as wav2vec 2.0. Experiments on ASR show that compared to wav2vec 2.0, wav2vec-S only requires marginal increment of pre-training time but could significantly improve ASR performance on in-domain, cross-domain and cross-lingual datasets. The average relative WER reductions are 26.3% and 6.3% for 1h and 10h fine-tuning, respectively.

[1]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Furu Wei,et al.  UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data , 2021, ICML.

[3]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[4]  Li-Rong Dai,et al.  XLST: Cross-lingual Self-training to Learn Multilingual Representation for Low Resource Speech Recognition , 2021, ArXiv.

[5]  Keqi Deng,et al.  Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-supervised Learning , 2021, Interspeech.

[6]  Gabriel Synnaeve,et al.  slimIPL: Language-Model-Free Iterative Pseudo-Labeling , 2020, Interspeech.

[7]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[8]  Yifan Gong,et al.  Large-Scale Domain Adaptation via Teacher-Student Learning , 2017, INTERSPEECH.

[9]  Zhong Meng,et al.  L-Vector: Neural Label Embedding for Domain Adaptation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yonghong Yan,et al.  Multi-Accent Adaptation Based on Gate Mechanism , 2019, INTERSPEECH.

[11]  Weiran Wang,et al.  Semi-supervised ASR by End-to-end Self-training , 2020, INTERSPEECH.

[12]  Li Wang,et al.  Domain Adaptation Using Class Similarity for Robust Speech Recognition , 2020, INTERSPEECH.

[13]  Shinji Watanabe,et al.  SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[14]  Takaaki Hori,et al.  Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition , 2021, Interspeech.

[15]  Shang-Wen Li,et al.  TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Karen Livescu,et al.  Layer-wise Analysis of a Self-supervised Speech Representation Model , 2021, ArXiv.

[17]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[18]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[19]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[20]  Steve Renals,et al.  Adaptation Algorithms for Speech Recognition: An Overview , 2020, ArXiv.