Enhanced Rules Application Order to Stem Affixation, Reduplication and Compounding Words in Malay Texts

Word stemmer is an automated program to remove affixes, clitics and particles from derived words based on morphological structures of specific natural languages. It has been widely used for text preprocessing in many artificial intelligence applications. Furthermore, the performance of word stemmer to correctly stem derived words has an influence to the performance of information retrieval, text mining and text categorization applications. Despite of various stemming approaches were proposed in the past research, the existing word stemmers for Malay language still suffer from stemming errors. Moreover, the existing word stemmers partially consider morphological structures of Malay language in which only focused on affixation words instead of affixation, reduplication and compounding words, simultaneously. Therefore, this paper proposes an enhanced word stemmer using rule-based affixes removal and dictionary lookup methods called enhanced rule application order that is able to stem affixation, reduplication and compounding words and at the same time, is able to address possible stemming errors. This paper also examines possible root causes of affixation, reduplication and compounding stemming errors that could happen during word stemming process. The experimental results indicate that the proposed word stemmer is able to stem affixation, reduplication and compounding words with better stemming accuracy by using enhanced rule application order.

[1]  Rayner Alfred,et al.  A Literature Review and Discussion of Malay Rule - Based Affix Elimination Algorithms , 2013, KMO.

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  Cheng Soon Ong,et al.  On designing an automated Malaysian stemmer for the Malay language (poster session) , 2000, IRAL '00.

[4]  Mangalam Sankupellay,et al.  Malay-language stemmer , 2006 .

[5]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[6]  Zainab Abu Bakar,et al.  Evaluating the Effectiveness of Thesaurus and Stemming Methods in Retrieving Malay Translated Al-Quran Documents , 2003, ICADL.

[7]  Ramli Bin Abdullah,et al.  Exhaustive Affix Stripping And A Malay Word Register To Solve Stemming Errors And Ambiguity Problem In Malay Stemmers , 2012 .

[8]  R. M. Rias,et al.  M-Hadith: Retrieving Malay Haditli text in a mobile application , 2012, 2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE).

[9]  Nurul Zawiyah Mohamad,et al.  Syllable-based Malay word stemmer , 2013, 2013 IEEE Symposium on Computers & Informatics (ISCI).

[10]  Zainab Abu Bakar,et al.  Characteristics and retrieval effectiveness of n-gram string similarity matching on Malay documents , 2011 .

[11]  Syed Abdullah Fadzli SIMPLE RULES MALAY STEMMER , 2012 .

[12]  Anazida Zainal,et al.  Enhanced Rules Application Order Approach to Stem Reduplication Words in Malay Texts , 2014, SCDM.

[13]  Rayner Alfred,et al.  Enhancing Malay Stemming Algorithm with Background Knowledge , 2012, PRICAI.

[14]  Riyad Al-Shalabi,et al.  Experiments with the Successor Variety Algorithm Using the Cutoff and Entropy Methods , 2005 .

[15]  Hidetoshi Yokoo,et al.  Stemming Malay Text and Its Application in Automatic Text Categorization , 2009, IEICE Trans. Inf. Syst..

[16]  Nik Rumzi Nik Idris Stemming for Term Conflation in Malay Texts. , 2001 .

[17]  Mohammed Yusoff,et al.  Experiments with a Stemming Algorithm for Malay Words , 1996, J. Am. Soc. Inf. Sci..

[18]  Deepika Sharma,et al.  Stemming Algorithms: A Comparative Study and their Analysis , 2012 .

[19]  Tengku Mohd Tengku Sembok,et al.  Rules Frequency Order Stemmer for Malay Language , 2009 .

[20]  Suleiman H. Mustafa,et al.  N-Gram-Based Techniques for Arabic Text Document Matching; Case Study: Courses Accreditation , 2012 .

[21]  Zainab Abu Bakar,et al.  Using Topic Analysis for Querying Halal Information on Malay Documents , 2014 .