TAGPRO: UN SISTEMA PER IL POS- TAGGING DELL'ITALIANO BASATO SU SVM TAGPRO: A SYSTEM FOR ITALIAN POS TAGGING BASED ON SVM

Part of speech tagging is the problem of determining the correct parts of speech of a sequence of words. The most frequently applied approaches for this task are based on machine learning: Hidden Markov Models [1], Maximum Entropy taggers [5], Transformation-based learning, Memory Based learning [2], Decision Trees [3] and Support Vector Machines (SVMs) [4]. SVMs are among the most widely used techniques, and various implementations are available. As argued by T. Joachims [6], one of advantages of SVMs is that dimensionality reduction is usually not needed, as they are robust to overfitting and scale up well to high feature dimensions. We used YAMCHA, an SVM-based machine learning environment [8], to build TagPro, a PoS-tagging system exploiting a rich set of linguistic features, such as morphological analysis and proper name gazetteers. TagPro is part of TextPro, a suite of NLP tools developed at FBK-irst, which includes MorphoPro, a morphological analyzer that provides the morphological analysis exploited by TagPro. TagPro was trained on the EVALITA development set, using the standard EAGLES tagset and a new, structurally different, tagset (DISTRIB). In the rest of the paper we give further details on SVMs, the feature space that we used, and the results we obtained.