A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

The Malay language is an Austronesian language spoken in most countries in the South East Asia region that includes Malaysia, Indonesia, Singapore, Brunei and Thailand. Traditional linguistics is well developed for Malay but there are very limited resources and tools that are available or made accessible for computer linguistic analysis of Malay language. Assigning part of speech (POS) to running words in a sentence for Malay language is one of the pipeline processes in Natural Language Processing (NLP) tasks and it is not well investigated. This paper outlines an approach to perform the Part of Speech (POS) tagging for Malay text articles. We apply a simple Rule-based Part of Speech (RPOS) tagger to perform the tagging operation on Malay text articles. POS tagging can be described as a task of performing automatic annotation of syntactic categories for each word in a text document. A rule-based POS tagger generally involves a POS tag dictionary and a set of rules in order to identify the words that are considered parts of speech. In this paper, we propose a framework that applies Malay affixing rules to identify the Malay POS tag and the relation between words in order to select the best POS tag for words that have two or more valid POS tags. The results show that the performance accuracy of the ruled-based POS tagger is higher compared to a statistical POS tagger. This indicates that the proposed RPOS tagger is able to predict any unknown word's POS at some promising accuracy.

[1]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[2]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[3]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[4]  Claudio Giuliano,et al.  Unsupervised Part of Speech Tagging Supporting Supervised Methods , 2007 .

[5]  David Elworthy,et al.  Does Baum-Welch Re-estimation Help Taggers? , 1994, ANLP.

[6]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[7]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[8]  Q.I. Wang,et al.  Improved estimation for unsupervised part-of-speech tagging , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[9]  Ayu Purwarianti,et al.  Developing cross language systems for language pair with limited resource : Indonesian-Japanese CLIR and CLQA , 2007 .

[10]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[11]  Michele Banko,et al.  Part-of-Speech Tagging in Context , 2004, COLING.