Weakly Supervised Construction of ASR Systems from Massive Video Data

Building Automatic Speech Recognition (ASR) systems from scratch is significantly challenging, mostly due to the time-consuming and financially-expensive process of annotating a large amount of audio data with transcripts. Although several unsupervised pre-training models have been proposed, applying such models directly might still be sub-optimal if more labeled, training data could be obtained without a large cost. In this paper, we present a weakly supervised framework for constructing ASR systems with massive video data. As videos often contain human-speech audios aligned with subtitles, we consider videos as an important knowledge source, and propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR). The underlying ASR model can be fine-tuned to fit any domain-specific target training datasets after weakly supervised pre-training. Extensive experiments show that our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.

[1]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[2]  Mark J. F. Gales,et al.  Selection of Multi-Genre Broadcast Data for the Training of Automatic Speech Recognition Systems , 2016, INTERSPEECH.

[3]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[4]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[5]  Simon Osindero,et al.  Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jonathan Le Roux,et al.  End-To-End Multi-Speaker Speech Recognition With Transformer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jonathan Le Roux,et al.  Streaming Automatic Speech Recognition with the Transformer Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Xipeng Qiu,et al.  Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation , 2020, Journal of Computer Science and Technology.

[9]  Stefan Wermter,et al.  KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos , 2018, EMNLP.

[10]  Horia Cucu,et al.  Automatic Annotation of Speech Corpora using Approximate Transcripts , 2020, 2020 43rd International Conference on Telecommunications and Signal Processing (TSP).

[11]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[12]  Hung-yi Lee,et al.  Meta Learning for End-To-End Low-Resource Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Qiangpeng Yang,et al.  IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection , 2018, IJCAI.

[14]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[15]  Takanobu Oba,et al.  A Joint End-to-End and DNN-HMM Hybrid Automatic Speech Recognition System with Transferring Sharable Knowledge , 2019, INTERSPEECH.

[16]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[17]  Hermann Ney,et al.  Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Ming Liu,et al.  RobuTrans: A Robust Transformer-Based Text-to-Speech Model , 2020, AAAI.

[19]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[20]  Kyu J. Han,et al.  Multi-Stride Self-Attention for Speech Recognition , 2019, INTERSPEECH.

[21]  Yan Li,et al.  The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Ian McLoughlin,et al.  SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition , 2020, INTERSPEECH.

[23]  Steve Renals,et al.  Learning Noise Invariant Features Through Transfer Learning For Robust End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Yung-Sung Chuang,et al.  SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering , 2019, ArXiv.

[25]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[26]  Yanhua Long,et al.  Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR , 2019, IEEE Access.

[27]  Brian Kingsbury,et al.  Sequence Noise Injected Training for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Tomoharu Iwata,et al.  Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Binqiang Zhao,et al.  O2U-Net: A Simple Noisy Label Detection Approach for Deep Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Ching-Feng Yeh,et al.  Aipnet: Generative Adversarial Pre-Training of Accent-Invariant Networks for End-To-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Yonghong Yan,et al.  Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).