An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach

ABSTRACT Stemming is the basic operation in Natural language processing (NLP) to remove derivational and inflectional affixes without performing a morphological analysis. This practice is essential to extract the root or stem. In NLP domains, the stemmer is used to improve the process of information retrieval (IR), text classifications (TC), text mining (TM) and related applications. In particular, Urdu stemmers utilize only uni-gram words from the input text by ignoring bigrams, trigrams, and n-gram words. To improve the process and efficiency of stemming, bigrams and trigram words must be included. Despite this fact, there are a few developed methods for Urdu stemmers in the past studies. Therefore, in this paper, we proposed an improved Urdu stemmer, using hybrid approach divided into multi-step operation, to deal with unigram, bigram, and trigram features as well. To evaluate the proposed Urdu stemming method, we have used two corpora; word corpus and text corpus. Moreover, two different evaluation metrics have been applied to measure the performance of the proposed algorithm. The proposed algorithm achieved an accuracy of 92.97% and compression rate of 55%. These experimental results indicate that the proposed system can be used to increase the effectiveness and efficiency of the Urdu stemmer for better information retrieval and text mining applications.

[1]  S.M.J. Rizvi,et al.  Analysis, Design and Implementation of Urdu Morphological Analyzer , 2005, 2005 Student Conference on Engineering Sciences and Technology.

[2]  Sandeep R. Sirsat,et al.  Strength and Accuracy Analysis of Affix Removal Stemming Algorithms , 2013 .

[3]  Shingo Kuroiwa,et al.  Stemming to improve translation lexicon creation form bitexts , 2006, Inf. Process. Manag..

[4]  Abdul Jabbar,et al.  A survey on Urdu and Urdu like language stemmers and stemming techniques , 2016, Artificial Intelligence Review.

[5]  Yasser El-Sonbaty,et al.  Exploring the Effects of Word Roots for Arabic Sentiment Analysis , 2013, IJCNLP.

[6]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[7]  Nisheeth Joshi,et al.  Design & development of rule based inflectional and derivational Urdu stemmer ‘Usal’ , 2015, 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE).

[8]  Kazem Taghva,et al.  Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[9]  Azadeh Shakery,et al.  A structural rule-based stemmer for Persian , 2010, 2010 5th International Symposium on Telecommunications.

[10]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[11]  Masnizah Mohd,et al.  Enhanced Arabic Information Retrieval: Light Stemming and Stop Words , 2013, M-CAIT.

[12]  Sarmad Hussain,et al.  Resources for Urdu Language Processing , 2008, IJCNLP.

[13]  Nisheeth Joshi,et al.  Rule based stemmer in Urdu , 2013, 2013 4th International Conference on Computer and Communication Technology (ICCCT).

[14]  Izzat Alsmadi,et al.  A novel root based Arabic stemmer , 2015, J. King Saud Univ. Comput. Inf. Sci..

[15]  Riaz Ahmed Islam The morphology of loanwords in Urdu : the Persian, Arabic and English strands , 2012 .

[16]  Gurpreet Singh Lehal,et al.  Rule Based Urdu Stemmer , 2012, COLING.

[17]  Abdelmonaime Lachkar,et al.  Effective Arabic Stemmer Based Hybrid Approach for Arabic Text Categorization , 2013 .

[18]  Ashok Kumar,et al.  Precision in Design Reusability using Software Agent based Design Triggers , 2012 .

[19]  Anjali Ganesh Jivani,et al.  A Comparative Study of Stemming Algorithms , 2011 .

[20]  Sarmad Hussain,et al.  Corpus Based Urdu Lexicon Development , 2007 .

[21]  Mohd. Shahid Husain,et al.  A Language Independent Approach to Develop Urdu Stemmer , 2012, ACITY.

[22]  Mohammad Reza Meybodi,et al.  Bon: The Persian Stemmer , 2002, EurAsia-ICT.

[23]  Wessel Kraaij,et al.  Evaluation of a Dutch stemming algorithm , 1994 .

[24]  Anthony McEnery,et al.  EMILLE: towards a corpus of South Asian languages. , 2000 .

[25]  Kheireddine Abainia,et al.  A novel robust Arabic light stemmer , 2017, J. Exp. Theor. Artif. Intell..

[26]  SAJID IQBAL,et al.  DESIGN AND DEVELOPMENT OF DICTIONARY-BASED STEMMER FOR THE URDU LANGUAGE , 2017 .

[27]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[28]  Martin Braschler,et al.  How Effective is Stemming and Decompounding for German Text Retrieval? , 2004, Information Retrieval.

[29]  Kashif Rizwan,et al.  Urdu Text Summarizer using Sentence Weight Algorithm for Word Processors , 2012 .

[30]  Sarmad Hussain,et al.  Assas-band, an Affix-Exception-List Based Urdu Stemmer , 2009, ALR7@IJCNLP.

[31]  Osama Mohamed Elrajubi An improved Arabic light stemmer , 2013, 2013 International Conference on Research and Innovation in Information Systems (ICRIIS).

[32]  Prasenjit Majumder,et al.  YASS: Yet another suffix stripper , 2007, TOIS.

[33]  Shehzad Khalid,et al.  A Novel Stemming Approach for Urdu Language , 2014 .

[34]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[35]  Sarmad Hussain,et al.  Analysis and Development of Urdu POS Tagged Corpus , 2009, ALR7@IJCNLP.

[36]  Viviane Pereira Moreira,et al.  Assessing the impact of Stemming Accuracy on Information Retrieval - A multilingual perspective , 2016, Inf. Process. Manag..

[37]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[38]  A. Ismailov,et al.  A comparative study of stemming algorithms for use with the Uzbek language , 2016, 2016 3rd International Conference on Computer and Information Sciences (ICCOINS).

[39]  Xuan Wang,et al.  A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language , 2012, WSSANLP@COLING.

[40]  Jessica Lin,et al.  Towards an error-free Arabic stemming , 2008, iNEWS '08.

[41]  Chris D. Paice An evaluation method for stemming algorithms , 1994, SIGIR '94.

[42]  Xuan Wang,et al.  Template based affix stemmer for a morphologically rich language , 2015, Int. Arab J. Inf. Technol..

[43]  Norisma Idris,et al.  Stemming Hausa text: using affix-stripping rules and reference look-up , 2016, Lang. Resour. Evaluation.