Polish tagger TaKIPI: rule based construction and optimization

A large number of different tags, limited corpora and the free word order are the main causes of low accuracy of tagging in Polish (automatic disambiguation of morphological descriptions) by applying commonly used techniques based on stochastic modelling. In the paper the rule-based architecture of the TaKIPI Polish tagger combining handwritten and automatically extracted rules is presented. The possibilities of optimisation of its parameters and component are discussed, including the possibility of using different methods of rules extraction, than C4.5 Decision Trees applied initially. The main goal of this paper is to explore a range of promising rule-based classifiers and investigate their impact on the accuracy of tagging. Simple techniques of combing classifiers are also tested. The performed experiments have shown that even a simple combination of different classifiers can increase the tagger’s accuracy by almost one percent.