论文信息 - Towards Kurdish Information Retrieval

Towards Kurdish Information Retrieval

The Kurdish language is an Indo-European language spoken in Kurdistan, a large geographical region in the Middle East. Despite having a large number of speakers, Kurdish is among the less-resourced languages and has not seen much attention from the IR and NLP research communities. This article reports on the outcomes of a project aimed at providing essential resources for processing Kurdish texts. A principal output of this project is Pewan, the first standard Test Collection to evaluate Kurdish Information Retrieval systems. The other language resources that we have built include a lightweight stemmer and a list of stopwords. Our second principal contribution is using these newly-built resources to conduct a thorough experimental study on Kurdish documents. Our experimental results show that normalization, and to a lesser extent, stemming, can greatly improve the performance of Kurdish IR systems.

[1] Mehrnoush Shamsfard,et al. STeP-1: A Set of Fundamental Tools for Persian Text Processing , 2010, LREC.

[2] James Mayfield,et al. Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[3] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[4] H. S. Heaps,et al. Information retrieval, computational and theoretical aspects , 1978 .

[5] Mandar Mitra,et al. FIRE: Forum for Information Retrieval Evaluation , 2008, IJCNLP.

[6] Justin Zobel,et al. How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[7] Ellen M. Voorhees,et al. Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[8] T. Skutnabb-Kangas,,et al. Introduction. Kurdish: Linguicide, resistance and hope , 2012 .

[9] Farhad Oroumchian,et al. N-gram and Local Context Analysis for Persian text retrieval , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[10] Donna K. Harman,et al. Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[11] K. Novak,et al. DNA repair: The guardian , 2003, Nature Reviews Cancer.

[12] Ricardo Baeza-Yates,et al. A Comparison of Open Source Search Engines , 2007 .

[13] K. Sparck Jones,et al. INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[14] W. Bruce Croft,et al. Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[15] W. Bruce Croft,et al. Dictionary Methods for Cross-Lingual Information Retrieval , 1996, DEXA.

[16] D. Hendrick,et al. Introduction , 1998, Thorax.

[17] Kyumars Sheykh Esmaili,et al. Challenges in Kurdish Text Processing , 2012, ArXiv.

[18] Fotis Lazarinis,et al. Current research issues and trends in non-English Web searching , 2009, Information Retrieval.

[19] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[20] Mahmood Neshati,et al. Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems , 2007, 2007 IEEE/ACS International Conference on Computer Systems and Applications.

[21] Ellen M. Voorhees,et al. Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[22] Om-Kolsoom Shahryari,et al. A comparison between allophone, syllable, and diphone based TTS systems for Kurdish language , 2009, 2009 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[23] Martin Braschler,et al. How Effective is Stemming and Decompounding for German Text Retrieval? , 2004, Information Retrieval.

[24] D. N. MacKenzie,et al. Kurdish dialect studies , 1961 .

[25] G. Haig,et al. Kurdish linguistics: a brief overview , 2002 .

[26] Kyumars Sheykh Esmaili,et al. Building a Test Collection for Sorani Kurdish , 2013, 2013 ACS International Conference on Computer Systems and Applications (AICCSA).

[27] Masoud Rahgozar,et al. Hamshahri: A standard Persian text collection , 2009, Knowl. Based Syst..

[28] Pollet Samvelian,et al. A lexical account of Sorani Kurdish prepositions , 2007, Proceedings of the International Conference on Head-Driven Phrase Structure Grammar.

[29] Julie Beth Lovins,et al. Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[30] Jacques Savoy,et al. A Stemming Procedure and Stopword List for General French Corpora , 1999, J. Am. Soc. Inf. Sci..

[31] Edward A. Fox,et al. Research Contributions , 2014 .

[32] N. H. Beebe. A Complete Bibliography of ACM Transactions on Asian Language Information Processing , 2007 .

[33] Alexander M. Fraser,et al. Empirical studies in strategies for Arabic retrieval , 2002, SIGIR '02.

[34] Waqas Anwar,et al. Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation , 2011 .

[35] David A. Hull. Stemming algorithms: a case study for detailed evaluation , 1996 .

[36] Géraldine Walther,et al. Developing a Large-Scale Lexicon for a Less-Resourced Language: General Methodology and Preliminary Experiments on Sorani Kurdish , 2010 .

[37] J. Sheyholislami,,et al. Identity, language, and new media: the Kurdish case , 2010 .

[38] Chris D. Paice. An evaluation method for stemming algorithms , 1994, SIGIR '94.

[39] Khaled Shaalan,et al. Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[40] M. Dorleijn. A study of European, Persian and Arabic loans in standard Sorani , 2005 .

[41] David A. Hull. Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[42] Kyumars Sheykh Esmaili,et al. Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison , 2013, ACL.