Statistical Language Modeling for Automatic Speech Recognition of Agglutinative Languages

Automatic Speech Recognition (ASR) systems utilize statistical acoustic and language models to find the most probable word sequence when the speech signal is given. Hidden Markov Models (HMMs) are used as acoustic models and language model probabilities are approximated using n-grams where the probability of a word is conditioned on n-1 previous words. The n-gram probabilities are estimated by Maximum Likelihood Estimation. One of the problems in n-gram language modeling is the data sparseness that results in non-robust probability estimates especially for rare and unseen n-grams. Therefore, smoothing is applied to produce better estimates for these n-grams. The traditional n-gram word language models are commonly used in state-of-the-art Large Vocabulary Continuous Speech Recognition (LVSCR) systems. These systems result in reasonable recognition performances for languages such as English and French. For instance, broadcast news (BN) in English can now be recognized with about ten percent word error rate (WER) (NIST, 2000) which results in mostly quite understandable text. Some rare and new words may be missing in the vocabulary but the result has proven to be sufficient for many important applications, such as browsing and retrieval of recorded speech and information retrieval from the speech (Garofolo et al., 2000). However, LVCSR attempts with similar systems in agglutinative languages, such as Finnish, Estonian, Hungarian and Turkish so far have not resulted in comparable performance to the English systems. The main reason of this performance deterioration in those languages is their rich morphological structure. In agglutinative languages, words are formed mainly by concatenation of several suffixes to the roots and together with compounding and inflections this leads to millions of different, but still frequent word forms. Therefore, it is practically impossible to build a word-based vocabulary for speech recognition in agglutinative languages that would cover all the relevant words. If words are used as language modeling units, there will be many out-of-vocabulary (OOV) words due to using limited vocabulary sizes in ASR systems. It was shown that with an optimized 60K lexicon O pe n A cc es s D at ab as e w w w .in te ch w eb .o rg

[1]  Ebru Arisoy,et al.  Language modeling for automatic turkish broadcast news transcription , 2007, INTERSPEECH.

[2]  Mikko Kurimo,et al.  Vocabulary Decomposition for Estonian Open Vocabulary Speech Recognition , 2007, ACL.

[3]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[4]  Erhan Mengusoglu,et al.  Turkish LVCSR: Database Preparation and Language Modeling for an Agglutinative Language , 2001 .

[5]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[6]  Vesa Siivola,et al.  Growing an n-gram language model , 2005, INTERSPEECH.

[7]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[8]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[9]  Teemu Hirsimäki,et al.  On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Ebru Arisoy,et al.  A unified language model for large vocabulary continuous speech recognition of Turkish , 2006, Signal Process..

[11]  Ebru Arisoy,et al.  Analysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages , 2007, HLT-NAACL.

[12]  R. J. Lickley,et al.  Proceedings of the International Conference on Spoken Language Processing. , 1992 .

[13]  Mikko Kurimo,et al.  On lexicon creation for turkish LVCSR , 2003, INTERSPEECH.

[14]  李幼升,et al.  Ph , 1989 .

[15]  Ronald Rosenfeld,et al.  Optimizing lexical and N-gram coverage via judicious use of linguistic data , 1995, EUROSPEECH.

[16]  William J. Byrne,et al.  On large vocabulary continuous speech recognition of highly inflectional language - czech , 2001, INTERSPEECH.

[17]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[18]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[19]  Janne Pylkkönen New pruning criteria for efficient decoding , 2005, INTERSPEECH.

[20]  Petr Podveský,et al.  Speech Recognition of Czech-Inclusion of Rare Words Helps , 2005, ACL.

[21]  Murat Saraclar,et al.  Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus , 2008, GoTAL.

[22]  Ebru Arisoy,et al.  Lattice extension and rescoring based approaches for LVCSR of Turkish , 2006, INTERSPEECH.

[23]  Murat Saraclar,et al.  Morphological Disambiguation of Turkish Text with Perceptron Algorithm , 2009, CICLing.

[24]  Mikko Kurimo,et al.  Duration modeling techniques for continuous speech recognition , 2004, INTERSPEECH.

[25]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[26]  I. Lee Hetherington A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding , 1995 .

[27]  K. Oflazer,et al.  Incorporating language constraints in sub-word based speech recognition , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[28]  Oh-Wook Kwon,et al.  Korean large vocabulary continuous speech recognition with morpheme-based recognition units , 2003, Speech Commun..

[29]  T. Ciloglu,et al.  Investigation of Different Language Models for Turkish Speech Recognition , 2006, 2006 IEEE 14th Signal Processing and Communications Applications.