Large vocabulary Uyghur continuous speech recognition based on stems and suffixes

In this paper, we study the vocabulary design problem in Uyghur large vocabulary continuous speech recognition (LVCSR). Uyghur is an agglutinative language in which words can be formed by concatenating several suffixes to the stem. As a result, the number of word types in Uyghur is unlimited. If the word is used as the recognition unit, the out-of-vocabulary (OOV) rate will be very large with typical vocabulary sizes of 60k–100k. To avoid this problem, we split words into stems and suffixes and use these sub-words as the recognition units. Speech recognition experiments are performed in two test sets, one including sentences in books and another including sentences in conversations. Compared to the 80k-word baseline, the use of stems and suffixes can alleviate the OOV rate problem dramatically and the best system reduces the word error rate (WER) from 46.5% to 44.5% in the book sentences test set and from 57.6% to 47.5% in the conversation sentences test set.

[1]  Parida Tursun,et al.  Uyghur noun suffix Finite State Machine for stemming , 2009, 2009 2nd IEEE International Conference on Computer Science and Information Technology.

[2]  Wushour Silamu,et al.  Large Vocabulary Continuous Speech Recognition in Uyghur: Data Preparation and Experimental Results , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[3]  Oh-Wook Kwon,et al.  Korean large vocabulary continuous speech recognition with morpheme-based recognition units , 2003, Speech Commun..

[4]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[5]  Ebru Arisoy,et al.  A unified language model for large vocabulary continuous speech recognition of Turkish , 2006, Signal Process..