Lexicon Size and Language Model Order Optimization for Russian LVCSR

In this paper, the comparison of 2,3,4-gram language models with various lexicon sizes is presented. The text data forming the training corpus has been collected from recent Internet news sites; total size of the corpus is about 350 million words 2.4 GB data. The language models were built using the recognition lexicons of 110K, 150K, 219K, and 303K words. For evaluation of these models such characteristics as perplexity, OOV words rate and n-gram hit rate were computed. Experimental results on continuous Russian speech recognition are also given in the paper.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[3]  Andrey Ronzhin,et al.  Speech recognition for east Slavic languages: the case of Russian , 2012, SLTU.

[4]  Jean-Luc Gauvain,et al.  Transcription of Russian conversational speech , 2012, SLTU.

[5]  Alexey Karpov,et al.  State-of-the-art speech recognition technologies for Russian language , 2012, HCCE '12.

[6]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[7]  Ngoc Thang Vu,et al.  Speech recognition for machine translation in Quaero , 2011, IWSLT.

[8]  Andrey Ronzhin,et al.  Very Large Vocabulary ASR for Spoken Russian with Syntactic and Morphemic Analysis , 2011, INTERSPEECH.

[9]  Alexey Karpov,et al.  Analysis of long-distance word dependencies and pronunciation variability at conversational Russian speech recognition , 2012, 2012 Federated Conference on Computer Science and Information Systems (FedCSIS).

[10]  Valeriy Pylypenko Extra large vocabulary continuous speech recognition algorithm based on information retrieval , 2007, INTERSPEECH.

[11]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[12]  Josef Psutka,et al.  Exploiting Linguistic Knowledge in Language Modeling of Czech Spontaneous Speech , 2006, LREC.

[13]  Philip C. Woodland,et al.  Efficient class-based language modelling for very large vocabularies , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[14]  Konstantin Markov,et al.  Phoneme set selection for russian speech recognition , 2011, 2011 7th International Conference on Natural Language Processing and Knowledge Engineering.

[15]  Wolfgang Minker,et al.  Speech and Language Resources for LVCSR of Russian , 2012, LREC.

[16]  Ebru Arisoy,et al.  Unlimited vocabulary speech recognition for agglutinative languages , 2006, NAACL.

[17]  Francoise Beaufays,et al.  Google Search by Voice: A Case Study , 2010 .