论文信息 - Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

This paper describes a simple and efficient pre-training method using a large number of external texts to enhance end-to-end automatic speech recognition (ASR). Generally, it is essential to prepare speech-to-text paired data to construct end-to-end ASR models, but it is difficult to collect a large amount of such data in practice. One issue caused by data scarcity is that the performance of ASR on out-of-domain tasks different from those using the speech-to-text paired data is poor, since the mapping from the speech information to textual information is not well learned. To address this problem, we leverage a large number of phoneme-to-grapheme (P2G) paired data, which can be easily created from external texts and a rich pronunciation dictionary. The P2G conversion and end-to-end ASR are regarded as similar transformation tasks where the input phonetic information is converted into textual information. Our method utilizes the P2G conversion task for pre-training of a decoder network in Transformer encoder-decoder based end-to-end ASR. Experiments using 4 billion tokens of Web text demonstrates that the performance of ASR on out-of-domain tasks can be significantly improved by our pre-training.

[1] Tatsuya Kawahara,et al. Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[2] Karen Livescu,et al. Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Steve Renals,et al. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition , 2015, INTERSPEECH.

[4] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[6] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[8] Akihiko Takashima,et al. Sequence-Level Consistency Training for Semi-Supervised End-to-End Automatic Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[10] Tomoharu Iwata,et al. Semi-Supervised End-to-End Speech Recognition , 2018, INTERSPEECH.

[11] Shinji Watanabe,et al. Multi-Modal Data Augmentation for End-to-end ASR , 2018, INTERSPEECH.

[12] Takanobu Oba,et al. Large Context End-to-end Automatic Speech Recognition via Extension of Hierarchical Recurrent Encoder-decoder Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Xu Tan,et al. MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[14] Wei Zhao,et al. Denoising based Sequence-to-Sequence Pre-training for Text Generation , 2019, EMNLP.

[15] Hermann Ney,et al. Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Satoshi Nakamura,et al. Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[18] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[19] Bhuvana Ramabhadran,et al. Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, INTERSPEECH.

[20] Awni Hannun,et al. Self-Training for End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Edouard Grave,et al. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[22] Navdeep Jaitly,et al. Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[23] Yusuke Ijima,et al. End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders , 2019, INTERSPEECH.

[24] Ramón Fernández Astudillo,et al. Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text , 2019, INTERSPEECH.

[25] Yu Zhang,et al. Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[26] Hitoshi Isahara,et al. Spontaneous Speech Corpus of Japanese , 2000, LREC.

[27] Yuzong Liu,et al. Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Jonathan Le Roux,et al. Cycle-consistency Training for End-to-end Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).