论文信息 - w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

Motivated by the success of masked language modeling (MLM) in pre-training natural language processing models, we propose w2vBERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two selfsupervised tasks (the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light 60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec 2.0 and HuBERT, our model shows 5% to 10% relative WER reduction on the test-clean and test-other subsets. When applied to the Google’s Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec 2.0 by more than 30% relatively.

[1] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[3] Shang-Wen Li,et al. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] Sree Hari Krishnan Parthasarathi,et al. Lessons from Building Acoustic Models with a Million Hours of Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Caiming Xiong,et al. Representation Learning for Sequence Data with Deep Autoencoding Predictive Components , 2020, ArXiv.

[6] Quoc V. Le,et al. Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[7] David Yarowsky,et al. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[8] Hung-yi Lee,et al. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Quoc V. Le,et al. Specaugment on Large Scale Datasets , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Armand Joulin,et al. Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12] Tara N. Sainath,et al. Scaling End-to-End Models for Large-Scale Multilingual ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13] Alexei Baevski,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Awni Hannun,et al. Self-Training for End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Gabriel Synnaeve,et al. Self-Training and Pre-Training are Complementary for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[18] Ellen Riloff,et al. Learning Extraction Patterns for Subjective Expressions , 2003, EMNLP.

[19] Yuzong Liu,et al. Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] James R. Glass,et al. Improved Speech Representations with Multi-Target Autoregressive Predictive Coding , 2020, ACL.

[21] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[22] Yu-An Chung,et al. Generative Pre-Training for Speech with Autoregressive Predictive Coding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[24] Quoc V. Le,et al. Searching for Activation Functions , 2018, arXiv.

[25] Alexei Baevski,et al. Effectiveness of self-supervised pre-training for speech recognition , 2019, ArXiv.

[26] George Zavaliagkos,et al. Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[27] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[28] Karen Livescu,et al. Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[30] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[32] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33] Tara N. Sainath,et al. Semi-supervised Training for End-to-end Models via Weak Distillation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34] Quoc V. Le,et al. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[35] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[36] Yuzong Liu,et al. DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization , 2020, ArXiv.

[37] Edouard Grave,et al. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[38] Richard M. Schwartz,et al. Analysis of low-resource acoustic model self-training , 2009, INTERSPEECH.

[39] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] H. J. Scudder,et al. Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[41] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[42] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.