Improved Subword Modeling for WFST-Based Speech Recognition

Because in agglutinative languages the number of observed word forms is very high, subword units are often utilized in speech recognition. However, the proper use of subword units requires careful consideration of details such as silence modeling, position-dependent phones, and combination of the units. In this paper, we implement subword modeling in the Kaldi toolkit by creating modified lexicon by finite-state transducers to represent the subword units correctly. We experiment with multiple types of word boundary markers and achieve the best results by adding a marker to the left or right side of a subword unit whenever it is not preceded or followed by a word boundary, respectively. We also compare three different toolkits that provide data-driven subword segmentations. In our experiments on a variety of Finnish and Estonian datasets, the best subword models do outperform word-based models and naive subword implementations. The largest relative reduction in WER is a 23% over word-based models for a Finnish read speech dataset. The results are also better than any previously published ones for the same datasets, and the improvement on all datasets is more than 5%.

[1]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[2]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[3]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[4]  Ebru Arisoy,et al.  Unlimited vocabulary speech recognition for agglutinative languages , 2006, NAACL.

[5]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[6]  Teemu Hirsimäki,et al.  On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[8]  Mehryar Mohri,et al.  Speech Recognition with Weighted Finite-State Transducers , 2008 .

[9]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Ebru Arisoy,et al.  Turkish Broadcast News Transcription and Retrieval , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Haihua Xu,et al.  An improved consensus-like method for Minimum Bayes Risk decoding and lattice combination , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Mikko Kurimo,et al.  Analysis of Extended Baum–Welch and Constrained Optimization for Discriminative Training of HMMs , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[15]  Mikko Kurimo,et al.  Learning a subword vocabulary based on unigram likelihood , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[16]  Tibor Fegyó,et al.  A bilingual study on the prediction of morph-based improvement , 2014, SLTU.

[17]  Mikko Kurimo,et al.  A Toolkit for Efficient Learning of Lexical Units for Speech Recognition , 2014, LREC.

[18]  Tanel Alumäe Neural network phone duration model for speech recognition , 2014, INTERSPEECH.

[19]  György Szaszák,et al.  Automatic Close Captioning for Live Hungarian Television Broadcast Speech: A Fast and Resource-Efficient Approach , 2015, SPECOM.

[20]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[21]  Mikko Kurimo,et al.  Class n-Gram Models for Very Large Vocabulary Speech Recognition of Finnish and Estonian , 2016, SLSP.

[22]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[23]  Mikko Kurimo,et al.  TheanoLM - An Extensible Toolkit for Neural Network Language Modeling , 2016, INTERSPEECH.