论文信息 - Speech Recognition by Simply Fine-Tuning Bert

Speech Recognition by Simply Fine-Tuning Bert

We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption is that given a history context sequence, a powerful LM can narrow the range of possible choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a powerful acoustic model (AM) from scratch, we believe that speech recognition is possible by simply fine-tuning a BERT model. As an initial study, we demonstrate the effectiveness of the proposed idea on the AISHELL dataset and show that stacking a very simple AM on top of BERT can yield reasonable performance.

[1] Tatsuya Kawahara,et al. Distilling the Knowledge of BERT for Sequence-to-Sequence ASR , 2020, INTERSPEECH.

[2] Wanxiang Che,et al. Revisiting Pre-Trained Models for Chinese Natural Language Processing , 2020, FINDINGS.

[3] Tao Qin,et al. Incorporating BERT into Neural Machine Translation , 2020, ICLR.

[4] Jian Jiao,et al. TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval , 2020, ArXiv.

[5] Davis Liang,et al. Masked Language Model Scoring , 2019, ACL.

[6] Yu Cheng,et al. Discourse-Aware Neural Extractive Text Summarization , 2019, ACL.

[7] Kyomin Jung,et al. Effective Sentence Scoring Method Using BERT for Speech Recognition , 2019, ACML.

[8] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[9] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[10] Zhe Zhao,et al. K-BERT: Enabling Language Representation with Knowledge Graph , 2019, AAAI.

[11] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[12] Linhao Dong,et al. CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] W. Bruce Croft,et al. BERT with History Answer Embedding for Conversational Question Answering , 2019, SIGIR.

[14] Jonathan Le Roux,et al. Triggered Attention for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[16] Yang Liu,et al. Fine-tune BERT for Extractive Summarization , 2019, ArXiv.

[17] Kyunghyun Cho,et al. Passage Re-ranking with BERT , 2019, ArXiv.

[18] Hao Zheng,et al. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[19] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[20] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[22] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[25] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .