Integrating Knowledge Into End-to-End Speech Recognition From External Text-Only Data

Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because of the end-to-end training, an AED model is usually trained with speech-text paired data. It is challenging to incorporate external text-only data into AED models. Another issue of the AED model is that it does not use the right context of a text token while predicting the token. To alleviate the above two issues, we propose a unified method called LST (Learn Spelling from Teachers) to integrate knowledge into an AED model from the external text-only data and leverage the whole context in a sentence. The method is divided into two stages. First, in the representation stage, a language model is trained on the text. It can be seen as that the knowledge in the text is compressed into the LM. Then, at the transferring stage, the knowledge is transferred to the AED model via teacher-student learning. To further use the whole context of the text sentence, we propose an LM called causal cloze completer (COR), which estimates the probability of a token, given both the left context and the right context of it. Therefore, with LST training, the AED model can leverage the whole context in the sentence. Different from fusion based methods, which use LM during decoding, the proposed method does not increase any extra complexity at the inference stage. We conduct experiments on two scales of public Chinese datasets AISHELL-1 and AISHELL-2. The experimental results demonstrate the effectiveness of leveraging external text-only data and the whole context in a sentence with our proposed method, compared with baseline hybrid systems and AED model based systems.

[1]  Jiajun Zhang,et al.  Synchronous Bidirectional Neural Machine Translation , 2019, TACL.

[2]  Mark J. F. Gales,et al.  Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Jianhua Tao,et al.  CLMAD: A Chinese Language Model Adaptation Dataset , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[4]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[5]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[6]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[7]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jiangyan Yi,et al.  Forward–Backward Decoding Sequence for Regularizing End-to-End TTS , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Yifan Gong,et al.  Large-Scale Domain Adaptation via Teacher-Student Learning , 2017, INTERSPEECH.

[10]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[13]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[17]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[18]  John W. Merrill,et al.  Automatic Speech Recognition , 2005 .

[19]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[20]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[21]  Maosong Sun,et al.  Scalable Term Selection for Text Categorization , 2007, EMNLP.

[22]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[23]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[24]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  José-Miguel Benedí,et al.  Improvement of a Whole Sentence Maximum Entropy Language Model Using Grammatical Features , 2001, ACL.

[26]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[28]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[29]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[30]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[31]  Shiyu Zhou,et al.  Unsupervised pre-traing for sequence to sequence speech recognition , 2019, ArXiv.

[32]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[33]  Enhong Chen,et al.  Regularizing Neural Machine Translation by Target-bidirectional Agreement , 2018, AAAI.

[34]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[35]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Tatsuya Kawahara,et al.  Forward-Backward Attention Decoder , 2018, INTERSPEECH.

[37]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[38]  Hui Bu,et al.  AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[39]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[40]  Hermann Ney,et al.  A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Yu Zhang,et al.  On training bi-directional neural network language model with noise contrastive estimation , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[42]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[43]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[44]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[45]  Mei-Yuh Hwang,et al.  Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[46]  Bin Wang,et al.  Learning Trans-Dimensional Random Fields with Applications to Language Modeling , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Björn W. Schuller,et al.  Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models: A Generative Approach to Sentiment Analysis , 2017, EACL.

[48]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Luke S. Zettlemoyer,et al.  Cloze-driven Pretraining of Self-attention Networks , 2019, EMNLP.

[51]  Ebru Arisoy,et al.  Bidirectional recurrent neural network language models for automatic speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Zhijian Ou,et al.  CAT: CRF-based ASR Toolkit , 2019, ArXiv.

[53]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[54]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[55]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[56]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[57]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[58]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Mark J. F. Gales,et al.  Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition , 2017, INTERSPEECH.

[60]  Shinji Watanabe,et al.  Recent Developments on Espnet Toolkit Boosted By Conformer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[62]  Jesús Andrés-Ferrer,et al.  Efficient Language Model Adaptation with Noise Contrastive Estimation and Kullback-Leibler Regularization , 2018, INTERSPEECH.

[63]  Bohyung Han,et al.  Learning to Specialize with Knowledge Distillation for Visual Question Answering , 2018, NeurIPS.

[64]  Dong Yu,et al.  Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Hermann Ney,et al.  Language Modeling with Deep Transformers , 2019, INTERSPEECH.

[66]  Anthony Rousseau,et al.  XenC: An Open-Source Tool for Data Selection in Natural Language Processing , 2013, Prague Bull. Math. Linguistics.

[67]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Jiangyan Yi,et al.  Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.