论文信息 - Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition

Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition

Self-supervised pre-training has dramatically improved the performance of automatic speech recognition (ASR). However, most existing self-supervised pre-training approaches are task-agnostic, i.e., could be applied to various downstream tasks. And there is a gap between the task-agnostic pretraining and the task-specific downstream fine-tuning, which may degrade the downstream performance. In this work, we propose a novel pre-training paradigm called wav2vec-S, where we use task-specific semi-supervised pre-training to bridge this gap. Specifically, the semi-supervised pre-training is conducted on the basis of self-supervised pre-training such as wav2vec 2.0. Experiments on ASR show that compared to wav2vec 2.0, wav2vec-S only requires marginal increment of pre-training time but could significantly improve ASR performance on in-domain, cross-domain and cross-lingual datasets. The average relative WER reductions are 26.3% and 6.3% for 1h and 10h fine-tuning, respectively.

[1] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Furu Wei,et al. UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data , 2021, ICML.

[3] Hao Zheng,et al. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[4] Li-Rong Dai,et al. XLST: Cross-lingual Self-training to Learn Multilingual Representation for Low Resource Speech Recognition , 2021, ArXiv.

[5] Keqi Deng,et al. Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-supervised Learning , 2021, Interspeech.

[6] Gabriel Synnaeve,et al. slimIPL: Language-Model-Free Iterative Pseudo-Labeling , 2020, Interspeech.

[7] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[8] Yifan Gong,et al. Large-Scale Domain Adaptation via Teacher-Student Learning , 2017, INTERSPEECH.

[9] Zhong Meng,et al. L-Vector: Neural Label Embedding for Domain Adaptation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Yonghong Yan,et al. Multi-Accent Adaptation Based on Gate Mechanism , 2019, INTERSPEECH.

[11] Weiran Wang,et al. Semi-supervised ASR by End-to-end Self-training , 2020, INTERSPEECH.

[12] Li Wang,et al. Domain Adaptation Using Class Similarity for Robust Speech Recognition , 2020, INTERSPEECH.

[13] Shinji Watanabe,et al. SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[14] Takaaki Hori,et al. Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition , 2021, Interspeech.

[15] Shang-Wen Li,et al. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16] Karen Livescu,et al. Layer-wise Analysis of a Self-supervised Speech Representation Model , 2021, ArXiv.

[17] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[18] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[19] Quoc V. Le,et al. Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[20] Steve Renals,et al. Adaptation Algorithms for Speech Recognition: An Overview , 2020, ArXiv.