Kurdish stemmer pre-processing steps for improving information retrieval

The rapid increase in the quantity of Kurdish documents over the last several years has created a need for improving information accuracy and precision in text classification and retrieval. Language stemming is an imperative pre-processing step for increasing the possibility of matching terms in a document in text classification tasks. Stemming helps reduce the total number of searchable terms within a document or query. This article proposes an active approach for stemming Kurdish Sorani texts to reduce variations of words to single terms or stems. The outcomes of the process, described in this article, demonstrate that decreasing the dimensionality of feature vectors in documents will increase the effectiveness of retrieval when the stemming process is used. This process applied for Kurdish Sorani can be adapted and applied in Kurdish Kurmanji as well for greater efficiency and effectiveness in digital text classification and applications.

[1]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[2]  Fardin Akhlaghian,et al.  Stemming for Kurdish Information Retrieval , 2013, AIRS.

[3]  Rebhi S. Baraka,et al.  Design and Evaluation of a Parallel Classifier for Large-Scale Arabic Text , 2013 .

[4]  Xin Li,et al.  Context sensitive stemming for web search , 2007, SIGIR.

[5]  Ali Behloul,et al.  Implementation of a New Hybrid Method for Stemming of Arabic Text , 2012 .

[6]  Carlos Alberto Heuser,et al.  Assessing the Impact of Stemming Accuracy on Information Retrieval , 2010, PROPOR.

[7]  Fredric C. Gey,et al.  Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[8]  Chris D. Paice Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[9]  Anjali Ganesh Jivani,et al.  A Comparative Study of Stemming Algorithms , 2011 .

[10]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[11]  M. Tashakori Bon : First Persian Stemmer , 2002 .

[12]  Nisheeth Joshi,et al.  Design & development of rule based inflectional and derivational Urdu stemmer ‘Usal’ , 2015, 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE).

[13]  Mahmoud Al-Ayyoub,et al.  Automatic Arabic text categorization: A comprehensive comparative study , 2015, J. Inf. Sci..

[14]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[15]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[16]  Kyumars Sheykh Esmaili,et al.  Building a Test Collection for Sorani Kurdish , 2013, 2013 ACS International Conference on Computer Systems and Applications (AICCSA).

[17]  Kabil BOUKHARI,et al.  RAID : Robust Algorithm for stemmIng text Document , 2016 .

[18]  Chris D. Paice An evaluation method for stemming algorithms , 1994, SIGIR '94.

[19]  B. Ishwar,et al.  Location of collinear equilibrium points in the generalised photogravitational elliptic restricted three body problem , 2011 .

[20]  Mohamed Nazih Omri,et al.  Possibilistic Model for Relevance Feedback in Collaborative Information Retrieval , 2012, Int. J. Web Appl..

[21]  Mrs. R. Jayanthi,et al.  An Approach for Effective Text Pre-Processing Using Improved Porters Stemming Algorithm , 2015 .

[22]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[24]  Wahiba Ben Abdessalem Karaa,et al.  Information Retrieval with Porter Stemmer: A New Version for English , 2013 .

[25]  Wessel Kraaij,et al.  Porter's stemming algorithm for Dutch , 1994 .

[26]  Vishal Gupta,et al.  A systematic review of text stemming techniques , 2016, Artificial Intelligence Review.

[27]  Pollet Samvelian,et al.  A lexical account of Sorani Kurdish prepositions , 2007, Proceedings of the International Conference on Head-Driven Phrase Structure Grammar.

[28]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[29]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.