Advances in subword-based HMM-DNN speech recognition across languages

Abstract We describe a novel way to implement subword language models in speech recognition systems based on weighted finite state transducers, hidden Markov models, and deep neural networks. The acoustic models are built on graphemes in a way that no pronunciation dictionaries are needed, and they can be used together with any type of subword language model, including character models. The advantages of short subword units are good lexical coverage, reduced data sparsity, and avoiding vocabulary mismatches in adaptation. Moreover, constructing neural network language models (NNLMs) is more practical, because the input and output layers are small. We also propose methods for combining the benefits of different types of language model units by reconstructing and combining the recognition lattices. We present an extensive evaluation of various subword units on speech datasets of four languages: Finnish, Swedish, Arabic, and English. The results show that the benefits of short subwords are even more consistent with NNLMs than with traditional n-gram language models. Combination across different acoustic models and language models with various units improve the results further. For all the four datasets we obtain the best results published so far. Our approach performs well even for English, where the phoneme-based acoustic models and word-based language models typically dominate: The phoneme-based baseline performance can be reached and improved by 4% using graphemes only when several grapheme-based models are combined. Furthermore, combining both grapheme and phoneme models yields the state-of-the-art error rate of 15.9% for the MGB 2018 dev17b test. For all four languages we also show that the language models perform reasonably well when only limited training data is available.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[3]  Mark J. F. Gales,et al.  Multi-task ensembles with teacher-student training , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[4]  Vaibhava Goel,et al.  Minimum Bayes-risk automatic speech recognition , 2000, Comput. Speech Lang..

[5]  Mikko Kurimo,et al.  Modeling under-resourced languages for speech recognition , 2017, Lang. Resour. Evaluation.

[6]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[7]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[8]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[9]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[10]  Geoffrey Zweig,et al.  Morpheme-Based Language Modeling for Arabic Lvcsr , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Tibor Fegyó,et al.  A bilingual study on the prediction of morph-based improvement , 2014, SLTU.

[12]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[13]  Mikko Kurimo,et al.  Character-based units for unlimited vocabulary continuous speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[15]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[18]  Mikko Kurimo,et al.  Aalto system for the 2017 Arabic multi-genre broadcast challenge , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[20]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[21]  Mikko Kurimo,et al.  Improved Subword Modeling for WFST-Based Speech Recognition , 2017, INTERSPEECH.

[22]  Mikko Kurimo,et al.  Automatic Construction of the Finnish Parliament Speech Corpus , 2017, INTERSPEECH.

[23]  Hermann Ney,et al.  On efficient training of word classes and their application to recurrent neural network language models , 2015, INTERSPEECH.

[24]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[25]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[26]  James R. Glass,et al.  The MGB-2 challenge: Arabic multi-dialect broadcast media recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[27]  Stephan Vogel,et al.  Advances in dialectal Arabic speech recognition: a study using Twitter to improve Egyptian ASR , 2014, IWSLT.

[28]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[29]  Sanjeev Khudanpur,et al.  JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[30]  Mehryar Mohri,et al.  Speech Recognition with Weighted Finite-State Transducers , 2008 .

[31]  Stephan Vogel,et al.  Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[32]  Teemu Hirsimäki,et al.  On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[34]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[35]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Ebru Arisoy,et al.  Turkish Broadcast News Transcription and Retrieval , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[38]  Mikko Kurimo,et al.  Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Andreas Stolcke,et al.  Morphology-based language modeling for conversational Arabic speech recognition , 2006, Comput. Speech Lang..

[40]  Hasim Sak,et al.  Multi-accent speech recognition with hierarchical grapheme based models , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Mikko Kurimo,et al.  Learning a subword vocabulary based on unigram likelihood , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[42]  Martti Vainio,et al.  Proceedings of the Annual Conference of the International Speech Communication Association , 2016, Interspeech 2016.

[43]  Yonghong Yan,et al.  An Exploration of Dropout with LSTMs , 2017, INTERSPEECH.

[44]  Hermann Ney,et al.  Forming Word Classes by Statistical Clustering for Statistical Language Modelling , 1993 .

[45]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[46]  Oskar Kohonen,et al.  Evaluating the effect of word frequencies in a probabilistic generative model of morphology , 2011, NODALIDA.

[47]  Yu Wang,et al.  PHONETIC AND GRAPHEMIC SYSTEMS FOR MULTI-GENRE BROADCAST TRANSCRIPTION , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Tara N. Sainath,et al.  No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Ebru Arisoy,et al.  Large Scale Hierarchical Neural Network Language Models , 2012, INTERSPEECH.

[50]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Haihua Xu,et al.  An improved consensus-like method for Minimum Bayes Risk decoding and lattice combination , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[52]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[53]  Mikko Kurimo,et al.  Automatic Speech Recognition for Northern Sámi with comparison to other Uralic Languages , 2016 .

[54]  Hagen Soltau,et al.  Morpheme-based feature-rich language models using Deep Neural Networks for LVCSR of Egyptian Arabic , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.