Improving Real-time Recognition of Morphologically Rich Speech with Transformer Language Model

Transformer models have become to state-of-the-art in natural language understanding, their use for language modeling in Automatic Speech Recognition (ASR) is also promising. Albeit Transformer based language models were shown to improve ASR performance, their computational complexity makes their application in real-time systems quite challenging. It has been also shown that the knowledge of such language models can be transferred to traditional n-gram models, suitable for real-time decoding. This paper investigates the adaptation of this transfer approach to morphologically rich languages, and in a real time scenario. We propose a new method for subword-based neural text augmentation with a Transformer language model, which consists in retokenizing the training corpus into subwords, based on a statistical data-driven approach. We demonstrate that ASR performance can be augmented by yet reducing the vocabulary size and alleviating memory consumption.

[1]  Sanjeev Khudanpur,et al.  Variational approximation of long-span language models for lvcsr , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[3]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Ebru Arisoy,et al.  Unlimited vocabulary speech recognition for agglutinative languages , 2006, NAACL.

[6]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[7]  Masayuki Suzuki,et al.  Improvements to N-gram Language Model Using Text Generated from Neural Language Model , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[9]  Mikko Kurimo,et al.  Improved Subword Modeling for WFST-Based Speech Recognition , 2017, INTERSPEECH.

[10]  Tibor Fegyó,et al.  Improved Recognition of Spontaneous Hungarian Speech—Morphological and Acoustic Modeling Techniques for a Less Resourced Task , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[12]  M. Secujski,et al.  Morphology-based vs Unsupervised Word Clustering for Training Language Models for Serbian , 2019, Acta Polytechnica Hungarica.

[13]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[14]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[15]  Balazs Tarjan,et al.  Evaluation of lexical models for Hungarian Broadcast speech transcription and spoken term detection , 2011, 2011 2nd International Conference on Cognitive Infocommunications (CogInfoCom).

[16]  Lili Szabo,et al.  Writing with speech: A qualitative user evaluation study , 2014, 2014 5th IEEE Conference on Cognitive Infocommunications (CogInfoCom).

[17]  Mikko Kurimo,et al.  Subword RNNLM Approximations for Out-Of-Vocabulary Keyword Search , 2019, INTERSPEECH.

[18]  Hermann Ney,et al.  Language Modeling with Deep Transformers , 2019, INTERSPEECH.

[19]  Tibor Fegyó,et al.  N-gram Approximation of LSTM Recurrent Language Models for Single-pass Recognition of Hungarian Call Center Conversations , 2019, 2019 10th IEEE International Conference on Cognitive Infocommunications (CogInfoCom).

[20]  Ngoc Thang Vu,et al.  Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding , 2014, INTERSPEECH.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  György Szaszák An Audio-based Sequential Punctuation Model for ASR and its Effect on Human Readability , 2019, Acta Polytechnica Hungarica.

[23]  György Szaszák,et al.  Automatic Close Captioning for Live Hungarian Television Broadcast Speech: A Fast and Resource-Efficient Approach , 2015, SPECOM.

[24]  Ebru Arisoy,et al.  Converting Neural Network Language Models into Back-off Language Models for Efficient Decoding in Automatic Speech Recognition , 2013, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[26]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  ChengXiang Zhai,et al.  Improving N-gram Language Models with Pre-trained Deep Transformer , 2019, ArXiv.

[29]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[30]  A. Esposito,et al.  Linguistic and Behaviour Interaction Analysis within Cognitive Infocommunications , 2019, 2019 10th IEEE International Conference on Cognitive Infocommunications (CogInfoCom).

[31]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.