论文信息 - Weakly Supervised Construction of ASR Systems from Massive Video Data

Weakly Supervised Construction of ASR Systems from Massive Video Data

Building Automatic Speech Recognition (ASR) systems from scratch is significantly challenging, mostly due to the time-consuming and financially-expensive process of annotating a large amount of audio data with transcripts. Although several unsupervised pre-training models have been proposed, applying such models directly might still be sub-optimal if more labeled, training data could be obtained without a large cost. In this paper, we present a weakly supervised framework for constructing ASR systems with massive video data. As videos often contain human-speech audios aligned with subtitles, we consider videos as an important knowledge source, and propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR). The underlying ASR model can be fine-tuned to fit any domain-specific target training datasets after weakly supervised pre-training. Extensive experiments show that our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.

[1] Gabriel Synnaeve,et al. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[2] Mark J. F. Gales,et al. Selection of Multi-Genre Broadcast Data for the Training of Automatic Speech Recognition Systems , 2016, INTERSPEECH.

[3] Hank Liao,et al. Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[4] Sanjeev Khudanpur,et al. A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[5] Simon Osindero,et al. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Jonathan Le Roux,et al. End-To-End Multi-Speaker Speech Recognition With Transformer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Jonathan Le Roux,et al. Streaming Automatic Speech Recognition with the Transformer Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Xipeng Qiu,et al. Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation , 2020, Journal of Computer Science and Technology.

[9] Stefan Wermter,et al. KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos , 2018, EMNLP.

[10] Horia Cucu,et al. Automatic Annotation of Speech Corpora using Approximate Transcripts , 2020, 2020 43rd International Conference on Telecommunications and Signal Processing (TSP).

[11] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[12] Hung-yi Lee,et al. Meta Learning for End-To-End Low-Resource Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Qiangpeng Yang,et al. IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection , 2018, IJCAI.

[14] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[15] Takanobu Oba,et al. A Joint End-to-End and DNN-HMM Hybrid Automatic Speech Recognition System with Transferring Sharable Knowledge , 2019, INTERSPEECH.

[16] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[17] Hermann Ney,et al. Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Ming Liu,et al. RobuTrans: A Robust Transformer-Based Text-to-Speech Model , 2020, AAAI.

[19] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[20] Kyu J. Han,et al. Multi-Stride Self-Attention for Speech Recognition , 2019, INTERSPEECH.

[21] Yan Li,et al. The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Ian McLoughlin,et al. SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition , 2020, INTERSPEECH.

[23] Steve Renals,et al. Learning Noise Invariant Features Through Transfer Learning For Robust End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Yung-Sung Chuang,et al. SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering , 2019, ArXiv.

[25] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[26] Yanhua Long,et al. Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR , 2019, IEEE Access.

[27] Brian Kingsbury,et al. Sequence Noise Injected Training for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Tomoharu Iwata,et al. Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Binqiang Zhao,et al. O2U-Net: A Simple Noisy Label Detection Approach for Deep Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Ching-Feng Yeh,et al. Aipnet: Generative Adversarial Pre-Training of Accent-Invariant Networks for End-To-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] Yonghong Yan,et al. Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).