论文信息 - Unsupervised pre-traing for sequence to sequence speech recognition

Unsupervised pre-traing for sequence to sequence speech recognition

This paper proposes a novel approach to pre-train encoder-decoder sequence-to-sequence (seq2seq) model with unpaired speech and transcripts respectively. Our pre-training method is divided into two stages, named acoustic pre-trianing and linguistic pre-training. In the acoustic pre-training stage, we use a large amount of speech to pre-train the encoder by predicting masked speech feature chunks with its context. In the linguistic pre-training stage, we generate synthesized speech from a large number of transcripts using a single-speaker text to speech (TTS) system, and use the synthesized paired data to pre-train decoder. This two-stage pre-training method integrates rich acoustic and linguistic knowledge into seq2seq model, which will benefit downstream automatic speech recognition (ASR) tasks. The unsupervised pre-training is finished on AISHELL-2 dataset and we apply the pre-trained model to multiple paired data ratios of AISHELL-1 and HKUST. We obtain relative character error rate reduction (CERR) from 38.24% to 7.88% on AISHELL-1 and from 12.00% to 1.20% on HKUST. Besides, we apply our pretrained model to a cross-lingual case with CALLHOME dataset. For all six languages in CALLHOME dataset, our pre-training method makes model outperform baseline consistently.

Shiyu Zhou | Bo Xu | Zhiyun Fan

[1] James R. Glass,et al. Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces , 2018, NeurIPS.

[2] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[3] Pascale Fung,et al. HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus , 2006, ISCSLP.

[4] Lin-Shan Lee,et al. Completely Unsupervised Phoneme Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models , 2019, INTERSPEECH.

[5] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[6] Shuang Xu,et al. Multilingual Recurrent Neural Networks with Residual Learning for Low-Resource Speech Recognition , 2017, INTERSPEECH.

[7] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[8] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[9] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11] Tanja Schultz,et al. Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[12] Navdeep Jaitly,et al. Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[13] Hao Zheng,et al. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[14] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[15] Tomoharu Iwata,et al. Semi-Supervised End-to-End Speech Recognition , 2018, INTERSPEECH.

[16] Yoshua Bengio,et al. On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[17] Benoît Sagot,et al. What Does BERT Learn about the Structure of Language? , 2019, ACL.

[18] Tara N. Sainath,et al. Semi-supervised Training for End-to-end Models via Weak Distillation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Peng Liu,et al. Cross-lingual speech recognition under runtime resource constraints , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20] Geoffrey E. Hinton,et al. Deep Belief Networks for phone recognition , 2009 .

[21] Linhao Dong,et al. Boosting Character-Based Chinese Speech Synthesis via Multi-Task Learning and Dictionary Tutoring , 2019, INTERSPEECH.

[22] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[23] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24] Hui Bu,et al. AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[25] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).