Character-Based N-gram Model for Uyghur Text Retrieval

Uyghur is a low resourced language, but Uyghur Information Retrieval (IR) is getting more and more important recently. Although there are related research results and stem-based Uyghur IR systems, it is always difficult to obtain high-performance retrieval results due to the limitations of the existing stemming method. In this paper, we propose a character-based N-gram model and the corresponding smoothing algorithm for Uyghur IR. A full-text IR system based on character N-gram model is developed using the open-source tool Lucene. A series of experiments and comparative analysis are conducted. Experimental results show that our proposed method has the better performance compared with conventional Uyghur IR systems.

[1]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[2]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[3]  Stephen E. Robertson,et al.  Applying Machine Learning to Text Segmentation for Information Retrieval , 2004, Information Retrieval.

[4]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[5]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[6]  Askar Hamdulla Key Techniques of Uyghur,Kazak,Kyrgyz Full-text Search Engine Retrieval Server , 2008 .

[7]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[8]  John Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR 1999.

[9]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[10]  Tatsuya Kawahara,et al.  Morpheme concatenation approach in language modeling for large-vocabulary Uyghur speech recognition , 2011, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).

[11]  Fan Aiwan N-Gram Statistical Information Retrieval Model Based on Bayesian Theory , 2010 .

[12]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[13]  Li Xiao Information Retrieval Based on Statistical Language Model , 2005 .

[14]  Askar Hamdulla,et al.  Research on Web Text Representation and the Similarity Based on Improved VSM in Uyghur Web Information Retrieval , 2010, 2010 Chinese Conference on Pattern Recognition (CCPR).