Abstract —In this paper, several methods are combined to improve the accuracy of HMM based POS tagger for Bahasa Indonesia. The first method is to employ affix tree which covers word suffix and prefix. The second one is to use succeeding POS tag as one of the feature for HMM. The last method is to use the additional lexicon (from KBBI-Kateglo ) in order to limit the candidate tags resulted by the affix tree. The HMM model was built on 15000-tokens data corpus. In the experiment, on a 15% OOV test corpus, the best accuracy was 96.50% with 99.4% for the in-vocabulary words and 80.4% for the OOV(out of vocabulary) words. The experiment showed that the affix tree and additional lexicon is effective in increasing the POS tagger accuracy, while the usage of succeeding POS tag does not give much improvement on the OOV handling. Keywords : POS tagger, HMM method, affix tree, succeeding POS tag I. I NTRODUCTION Part-of-Speech (POS) tagging is the process of assigning part-of-speech tags to words in a text [7, 20, 5]. A part-of-speech tag is a grammatical category such as verbs, nouns, adjectives, adverbs, and so on. Part-of-speech tagger is an essential tool in many natural language processing applications such as word sense disambiguation, parsing, question answering, and machine translation [12, 2]. Manually assigning part-of-speech tags to words is an expensive, laborious, and time consuming task, hence the widespread interest in automating the process. The main problems in designing an accurate automatic part-of-speech tagging are word ambiguity and Out-of-Vocabulary (OOV) word. Word ambiguity refers to different behaviour of words in different context. OOV word is a word that is unavailable in the annotated corpus. There are several approaches on POS tagging research, i.e. rule based, probabilistic, and transformational based approach. Rule based POS tagger assigns a POS tag to a word based on several manually created linguistic rules [8]. Probabilistic approach determines the most probable tag of a token given its surrounding context, based on probability values obtained from a manually tagged corpus [7]. The transformational based approach combines rule based and probabilistic approach to automatically derive symbolic rules from a corpus [3]. Bahasa Indonesia is the national language of Indonesia which is spoken by more than 222 millions people [11]. It is widely used in Indonesia to communicate in school, government offices, daily life, etc. Bahasa Indonesia became the formal language of the country, uniting its citizens who speak different languages. Bahasa Indonesia has become the language that bridges the language barrier among Indonesians who have different mother-tongues. Even though, the availability of language tools and resource for research related to Bahasa Indonesia is still limited. One language tool that is not yet commonly available for Bahasa Indonesia is POS tagger system. There are relatively little works on POS tagging system for Bahasa Indonesia. Pisceldo et al [7] tried to develop POS tagger for Bahasa Indonesia using Maximum Entropy model and Conditional Random Field (CRF). The best performance of Indonesian POS tagger reached in [3] is 97.57%. Triastuti [3] also developed POS tagger for Bahasa Indonesia using CRF, Transformation Based approach, and combining between CRF – Transformation based approach. The best performance in [3] reached 90.08%. Sari et al. [24] also applied Brill’s transformational rule based approach in developing a POS tagger for Bahasa Indonesia on a limited tagset trained on a small manually and contextual features, they showed that the method obtained an accuracy of 88%. However, there is no deep study about developing POS tagger Bahasa Indonesia by using Hidden Markov Model (HMM). Hidden Markov Model is the established probabilistic method for automatic POS tagger. Several languages have adapted the HMM method in building the automatic POS tagger [16, 17, 6, 1]. A POS tagger with HMM method was proved to have better running time than any other probabilistic methods [14]. In this study, we report our attempt in developing a HMM based part-of-speech tagger for Bahasa Indonesia
[1]
Helmut Schmid,et al.
Improvements in Part-of-Speech Tagging with an Application to German
,
1999
.
[2]
Mary P. Harper,et al.
A Second-Order Hidden Markov Model for Part-of-Speech Tagging
,
1999,
ACL.
[3]
Saeid Rahati Quchani,et al.
Persian part of speech tagger based on Hidden Markov Model
,
2008
.
[4]
Yuji Matsumoto,et al.
Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines
,
2001,
NLPRS.
[5]
Penelope Sibun,et al.
A Practical Part-of-Speech Tagger
,
1992,
ANLP.
[6]
Thorsten Brants,et al.
TnT – A Statistical Part-of-Speech Tagger
,
2000,
ANLP.
[7]
Yunsong Guo,et al.
Comparisons of sequence labeling algorithms and extensions
,
2007,
ICML '07.