Some initial experiments with Indonesian probabilistic parsing

This paper presents initial experiments in constructing a probabilistic parser for Indonesian. Due to the unavailability of a large manually parsed corpus, we start from an existing symbolic parser [4] to parse a balanced collection of Indonesian sentences. A probabilistic CFG language model is extracted, ignoring explicit linguistic information encoded in feature structures, and is subsequently used to parse an unseen collection of sentences. The resulting parse trees are evaluated against the set of candidate parses returned by the symbolic parser. The initial results indicate that the PCFG is failing to accurately capture verb subcategorization information.