论文信息 - Aksara: An Indonesian Morphological Analyzer that Conforms to the UD v2 Annotation Guidelines

Aksara: An Indonesian Morphological Analyzer that Conforms to the UD v2 Annotation Guidelines

The objective of this work is to build an Indonesian morphological analyzer named Aksara that conforms to the Universal Dependencies (UD), especially UD v2. Many works had developed Indonesian morphological analyzer, but as far as we know none conforms to the UD annotation guidelines. In building Aksara we use the same approach with MorphInd, another Indonesian morphological analyzer, that uses finite state compiler named Foma. Aksara has capability to perform four tasks: 1) word segmentation, 2) lemmatization, 3) POS tagging, and 4) morphological features analysis. To evaluate the quality of this tool, we used an Indonesian dependency treebank that conforms to UD v2 as the gold standard. We also compare the performance measures of Aksara with MorphInd, by mapping MorphInd output to CoNNL-U format. The experiment results show that for all the four tasks Aksara outperforms MorphInd. For word segmentation task, Aksara has accuracy of 96.9%, for lemmatization with case-sensitive it has accuracy of 94.83%, for POS tagging it has F1-score of 88.2% and finally for morphological features analysis, among 18 feature-value tags already implemented, nine tags already have F1-score more than 80%.

Ika Alfina | Muhammad Yudistira Hanifmuti

[1] Joakim Nivre,et al. Universal Stanford dependencies: A cross-linguistic typology , 2014, LREC.

[2] Mohammed Attia,et al. Arabic Tokenization System , 2007, SEMITIC@ACL.

[3] Martin Potthast,et al. CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2018, CoNLL.

[4] Septina Dian Larasati,et al. Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus , 2011, SFCM.

[5] Ika Alfina,et al. Selecting the UD v2 Morphological Features for Indonesian Dependency Treebank , 2020, 2020 International Conference on Asian Language Processing (IALP).

[6] Sampo Pyysalo,et al. Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[7] Ruli Manurung,et al. A Two-Level Morphological Analyser for the Indonesian Language , 2008, ALTA.

[8] Sampo Pyysalo,et al. Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection , 2020, LREC.

[9] Mans Hulden,et al. Foma: a Finite-State Compiler and Library , 2009, EACL.

[10] Christopher D. Manning,et al. The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[11] Christopher D. Manning,et al. Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[12] Ika Alfina,et al. A gold standard dependency treebank for Indonesian , 2019 .