论文信息 - SCaLa: Supervised Contrastive Learning for End-to-End Automatic Speech Recognition

SCaLa: Supervised Contrastive Learning for End-to-End Automatic Speech Recognition

End-to-end Automatic Speech Recognition (ASR) models are usually trained to reduce the losses of the whole token sequences, while neglecting explicit phonemic-granularity supervision. This could lead to recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, this paper proposes a novel framework of Supervised Contrastive Learning (SCaLa) to enhance phonemic information learning for end-to-end ASR systems. Specifically, we introduce the self-supervised Masked Contrastive Predictive Coding (MCPC) into the fully-supervised setting. To supervise phoneme learning explicitly, SCaLa first masks the variable-length encoder features corresponding to phonemes given phoneme forced-alignment extracted from a pre-trained acoustic model, and then predicts the masked phonemes via contrastive learning. The phoneme forced-alignment can mitigate the noise of positive-negative pairs in self-supervised MCPC. Experimental results conducted on reading and spontaneous speech datasets show that the proposed approach achieves 2.84% and 1.38% Character Error Rate (CER) reductions compared to the baseline, respectively.

[1] Kyle Gorman,et al. Prosodylab-aligner: A tool for forced alignment of laboratory speech , 2011 .

[2] Quoc V. Le,et al. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[3] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[4] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[5] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[6] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[7] Nut Limsopatham,et al. Using Phoneme Representations to Build Predictive Models Robust to ASR Errors , 2020, SIGIR.

[8] Abdel-rahman Mohamed,et al. Effectiveness of Self-Supervised Pre-Training for ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Gabriel Synnaeve,et al. Joint Masked CPC and CTC Training for ASR , 2020, ArXiv.

[10] Bowen Zhou,et al. Incremental Learning for End-to-End Automatic Speech Recognition , 2020, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11] Hung-yi Lee,et al. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[13] Chengyi Wang,et al. Semantic Mask for Transformer based End-to-End Speech Recognition , 2020, INTERSPEECH.

[14] Yonatan Belinkov,et al. Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems , 2017, NIPS.

[15] Hao Zheng,et al. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[16] Xiangang Li,et al. Improving Transformer-based Speech Recognition Using Unsupervised Pre-training , 2019, ArXiv.

[17] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[18] Yashesh Gaur,et al. On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition , 2020, INTERSPEECH.

[19] Ce Liu,et al. Supervised Contrastive Learning , 2020, NeurIPS.

[20] Tara N. Sainath,et al. A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[21] Furu Wei,et al. UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data , 2021, ICML.

[22] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[23] Tao Han,et al. Supervised Contrastive Learning for Accented Speech Recognition , 2021, ArXiv.

[24] Lei Xie,et al. WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit , 2021, Interspeech.

[25] Alexei Baevski,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[26] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[27] Kenneth Ward Church,et al. Decoupling Recognition and Transcription in Mandarin ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28] Kuan-Yu Chen,et al. Non-autoregressive Transformer-based End-to-end ASR using BERT , 2021, ArXiv.