论文信息 - Investigation on N-gram Approximated RNNLMs for Recognition of Morphologically Rich Speech

Investigation on N-gram Approximated RNNLMs for Recognition of Morphologically Rich Speech

Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Recurrent Neural Network Language Model (RNNLM) can provide remedy for the high perplexity of the task; however, two-pass decoding introduces a considerable processing delay. In order to eliminate this delay we investigate approaches aiming at the complexity reduction of RNNLM, while preserving its accuracy. We compare the performance of conventional back-off n-gram language models (BNLM), BNLM approximation of RNNLMs (RNN-BNLM) and RNN n-grams in terms of perplexity and word error rate (WER). Morphological richness is often addressed by using statistically derived subwords - morphs - in the language models, hence our investigations are extended to morph-based models, as well. We found that using RNN-BNLMs 40% of the RNNLM perplexity reduction can be recovered, which is roughly equal to the performance of a RNN 4-gram model. Combining morph-based modeling and approximation of RNNLM, we were able to achieve 8% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system.

[1] Tibor Fegyó,et al. Improved Recognition of Spontaneous Hungarian Speech—Morphological and Acoustic Modeling Techniques for a Less Resourced Task , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Hermann Ney,et al. LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[3] Mikko Kurimo,et al. Improved Subword Modeling for WFST-Based Speech Recognition , 2017, INTERSPEECH.

[4] Andreas Stolcke,et al. Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[5] Hermann Ney,et al. Investigation on LSTM Recurrent N-gram Language Models for Speech Recognition , 2018, INTERSPEECH.

[6] Sanjeev Khudanpur,et al. Variational approximation of long-span language models for lvcsr , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Ebru Arisoy,et al. Unlimited vocabulary speech recognition for agglutinative languages , 2006, NAACL.

[8] Mikko Kurimo,et al. Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9] Kai-Fu Lee,et al. Automatic Speech Recognition , 1989 .

[10] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[11] Ngoc Thang Vu,et al. Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding , 2014, INTERSPEECH.

[12] Ebru Arisoy,et al. Converting Neural Network Language Models into Back-off Language Models for Efficient Decoding in Automatic Speech Recognition , 2013, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Tibor Fegyó,et al. Improved recognition of Hungarian call center conversations , 2013, 2013 7th Conference on Speech Technology and Human - Computer Dialogue (SpeD).

[14] Balazs Tarjan,et al. Evaluation of lexical models for Hungarian Broadcast speech transcription and spoken term detection , 2011, 2011 2nd International Conference on Cognitive Infocommunications (CogInfoCom).

[15] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[16] Wojciech Zaremba,et al. Recurrent Neural Network Regularization , 2014, ArXiv.

[17] Mikko Kurimo,et al. First-pass decoding with n-gram approximation of RNNLM: The problem of rare words , 2018 .

[18] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[19] Samy Bengio,et al. N-gram Language Modeling using Recurrent Neural Network Estimation , 2017, ArXiv.

[20] Mathias Creutz,et al. Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[21] Mikko Kurimo,et al. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[22] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .