The field of Natural Language Processing (NLP) in the country has been continually developing. However, the transition between Tagalog to the progressing Filipino language left tools and resources behind. This paper introduces a Statistical Machine Translation Part-of-Speech (POS) Tagger for Filipino (SMTPOST), with the purpose of reviving, updating and widening the scope of technologies in the POS` tagging domain, catering to the changes made by the Filipino language. Resources built are comprised mainly of a tagset (218 tags), parallel corpus (2,668 sentences), affix rules (59 rules) and word-tag dictionary (309 entries). SMTPOST was tested to different tagsets and domains, producing 84.75% as its highest accuracy score, at least 3.75% increase from the available Tagalog POS taggers. Despite SMTPOST’s utilization of Filipino resources and good performance, there are room for improvements and opportunities. Recommendations include a better feature extractor (preferably a morphological analyzer), an increase in scope for all of the resources, implementation of preand/or postprocessing, and the utilization of SMTPOST research to other NLP applications.
[1]
Rachel Edita,et al.
Comparative Evaluation of Tagalog Part-of-Speech Taggers
,
2007
.
[2]
Don Erick J. Bonus.
A stemming algorithm for Tagalog words
,
2003
.
[3]
Charibeth Ko Cheng,et al.
NormAPI: An API for normalizing Filipino shortcut texts
,
2014,
2014 International Conference on Asian Language Processing (IALP).
[5]
Shirley B. Chu.
Language Resource Development at DLSU-NLP Lab
,
2009
.
[6]
Tomoki Toda,et al.
Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T)
,
2015,
2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).
[7]
Yuji Matsumoto,et al.
Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners
,
2011,
IJCNLP.