Towards Kurdish Information Retrieval

The Kurdish language is an Indo-European language spoken in Kurdistan, a large geographical region in the Middle East. Despite having a large number of speakers, Kurdish is among the less-resourced languages and has not seen much attention from the IR and NLP research communities. This article reports on the outcomes of a project aimed at providing essential resources for processing Kurdish texts. A principal output of this project is Pewan, the first standard Test Collection to evaluate Kurdish Information Retrieval systems. The other language resources that we have built include a lightweight stemmer and a list of stopwords. Our second principal contribution is using these newly-built resources to conduct a thorough experimental study on Kurdish documents. Our experimental results show that normalization, and to a lesser extent, stemming, can greatly improve the performance of Kurdish IR systems.

[1]  Mehrnoush Shamsfard,et al.  STeP-1: A Set of Fundamental Tools for Persian Text Processing , 2010, LREC.

[2]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[5]  Mandar Mitra,et al.  FIRE: Forum for Information Retrieval Evaluation , 2008, IJCNLP.

[6]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[7]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[8]  T. Skutnabb-Kangas,,et al.  Introduction. Kurdish: Linguicide, resistance and hope , 2012 .

[9]  Farhad Oroumchian,et al.  N-gram and Local Context Analysis for Persian text retrieval , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[10]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[11]  K. Novak,et al.  DNA repair: The guardian , 2003, Nature Reviews Cancer.

[12]  Ricardo Baeza-Yates,et al.  A Comparison of Open Source Search Engines , 2007 .

[13]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[14]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[15]  W. Bruce Croft,et al.  Dictionary Methods for Cross-Lingual Information Retrieval , 1996, DEXA.

[16]  D. Hendrick,et al.  Introduction , 1998, Thorax.

[17]  Kyumars Sheykh Esmaili,et al.  Challenges in Kurdish Text Processing , 2012, ArXiv.

[18]  Fotis Lazarinis,et al.  Current research issues and trends in non-English Web searching , 2009, Information Retrieval.

[19]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Mahmood Neshati,et al.  Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems , 2007, 2007 IEEE/ACS International Conference on Computer Systems and Applications.

[21]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[22]  Om-Kolsoom Shahryari,et al.  A comparison between allophone, syllable, and diphone based TTS systems for Kurdish language , 2009, 2009 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[23]  Martin Braschler,et al.  How Effective is Stemming and Decompounding for German Text Retrieval? , 2004, Information Retrieval.

[24]  D. N. MacKenzie,et al.  Kurdish dialect studies , 1961 .

[25]  G. Haig,et al.  Kurdish linguistics: a brief overview , 2002 .

[26]  Kyumars Sheykh Esmaili,et al.  Building a Test Collection for Sorani Kurdish , 2013, 2013 ACS International Conference on Computer Systems and Applications (AICCSA).

[27]  Masoud Rahgozar,et al.  Hamshahri: A standard Persian text collection , 2009, Knowl. Based Syst..

[28]  Pollet Samvelian,et al.  A lexical account of Sorani Kurdish prepositions , 2007, Proceedings of the International Conference on Head-Driven Phrase Structure Grammar.

[29]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[30]  Jacques Savoy,et al.  A Stemming Procedure and Stopword List for General French Corpora , 1999, J. Am. Soc. Inf. Sci..

[31]  Edward A. Fox,et al.  Research Contributions , 2014 .

[32]  N. H. Beebe A Complete Bibliography of ACM Transactions on Asian Language Information Processing , 2007 .

[33]  Alexander M. Fraser,et al.  Empirical studies in strategies for Arabic retrieval , 2002, SIGIR '02.

[34]  Waqas Anwar,et al.  Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation , 2011 .

[35]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[36]  Géraldine Walther,et al.  Developing a Large-Scale Lexicon for a Less-Resourced Language: General Methodology and Preliminary Experiments on Sorani Kurdish , 2010 .

[37]  J. Sheyholislami,,et al.  Identity, language, and new media: the Kurdish case , 2010 .

[38]  Chris D. Paice An evaluation method for stemming algorithms , 1994, SIGIR '94.

[39]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[40]  M. Dorleijn A study of European, Persian and Arabic loans in standard Sorani , 2005 .

[41]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[42]  Kyumars Sheykh Esmaili,et al.  Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison , 2013, ACL.