The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents in Jawi Characters

The Malay language has two types of writing script, known as Rumi and Jawi. Most previous stemmer results have reported on Malay Rumi characters and only a few have tested Jawi characters. In this article, a new Jawi stemmer has been proposed and tested for document retrieval. A total of 36 queries and datasets from the transliterated Jawi Quran were used. The experiment shows that the mean average precision for a “stemmed Jawi” document is 8.43%. At the same time, the mean average precision for a “nonstemmed Jawi” document is 5.14%. The result from a paired sample t-test showed that the use of a “stemmed Jawi” document increased the precision in document retrieval. Further experiments were performed to examine the precision of the relevant documents that were retrieved at various cutoff points for all 36 queries. The results for the “stemmed Jawi” document showed a significantly different start, at a cutoff of 40, compared with the “nonstemmed Jawi” documents. This result shows the usefulness of a Jawi stemmer for retrieving relevant documents in the Jawi script.

[1]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[2]  James Allan,et al.  Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes , 2009, SIGIR.

[3]  N. H. Beebe A Complete Bibliography of ACM Transactions on Asian Language Information Processing , 2007 .

[4]  Mohammed Yusoff,et al.  Experiments with a Stemming Algorithm for Malay Words , 1996, J. Am. Soc. Inf. Sci..

[5]  Muhamad Taufik Abdullah Monolingual and Cross-language Information Retrieval Approaches for Malay and English Language Document , 2006 .

[6]  Jacques Savoy A stemming procedure and stopword list for general French corpora , 1999 .

[7]  Hugh E. Williams,et al.  Stemming Indonesian: A confix-stripping approach , 2007, TALIP.

[8]  Md. Zahurul Islam,et al.  A light weight stemmer for Bengali and its use in spelling checker , 2007 .

[9]  Nazlia Omar,et al.  A Malay Stemmer for Jawi Characters , 2011, Australasian Conference on Artificial Intelligence.

[10]  Mohamad Shanudin Zakaria,et al.  Jawi-Malay transliteration , 2009, 2009 International Conference on Electrical Engineering and Informatics.

[11]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[12]  Ola Knutsson,et al.  Improving Precision in Information Retrieval for Swedish using Stemming , 2001, NODALIDA.

[13]  Jacques Savoy,et al.  A Stemming Procedure and Stopword List for General French Corpora , 1999, J. Am. Soc. Inf. Sci..

[14]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[15]  Carlos Alberto Heuser,et al.  Assessing the Impact of Stemming Accuracy on Information Retrieval , 2010, PROPOR.

[16]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[17]  Tengku Mohd Tengku Sembok,et al.  Rules Frequency Order Stemmer for Malay Language , 2009 .

[18]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[19]  H. Abdullah,et al.  The morphology of Malay , 1972 .

[20]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[21]  Mohamad Shanudin Zakaria,et al.  Handwritten Cursive Jawi Character Recognition: A Survey , 2008, 2008 Fifth International Conference on Computer Graphics, Imaging and Visualisation.

[22]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[23]  Nazlena Mohamad Ali,et al.  ISTILAH SAINS: A Malay-English Terminology Retrieval System Experiment Using Stemming and N-grams Approach on Malay Words , 2003, ICADL.

[24]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[25]  Ding Choo Ming Access to Malay manuscripts , 1987 .

[26]  Tengku Mohd Tengku Sembok Word Stemming Algorithms and Retrieval Effectiveness in Malay and Arabic Documents Retrieval Systems , 2007 .

[27]  Chris D. Paice An evaluation method for stemming algorithms , 1994, SIGIR '94.