Part of speech tagging is the problem of determining the correct parts of speech of a sequence of words. The most frequently applied approaches for this task are based on machine learning: Hidden Markov Models [1], Maximum Entropy taggers [5], Transformation-based learning, Memory Based learning [2], Decision Trees [3] and Support Vector Machines (SVMs) [4]. SVMs are among the most widely used techniques, and various implementations are available. As argued by T. Joachims [6], one of advantages of SVMs is that dimensionality reduction is usually not needed, as they are robust to overfitting and scale up well to high feature dimensions. We used YAMCHA, an SVM-based machine learning environment [8], to build TagPro, a PoS-tagging system exploiting a rich set of linguistic features, such as morphological analysis and proper name gazetteers. TagPro is part of TextPro, a suite of NLP tools developed at FBK-irst, which includes MorphoPro, a morphological analyzer that provides the morphological analysis exploited by TagPro. TagPro was trained on the EVALITA development set, using the standard EAGLES tagset and a new, structurally different, tagset (DISTRIB). In the rest of the paper we give further details on SVMs, the feature space that we used, and the results we obtained.
[1]
Thorsten Joachims,et al.
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
,
1998,
ECML.
[2]
Adwait Ratnaparkhi,et al.
A Maximum Entropy Model for Part-Of-Speech Tagging
,
1996,
EMNLP.
[3]
Walter Daelemans,et al.
MBT: A Memory-Based Part of Speech Tagger-Generator
,
1996,
VLC@COLING.
[4]
Yuji Matsumoto,et al.
Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines
,
2001,
NLPRS.
[5]
Horacio Rodríguez,et al.
Part-of-Speech Tagging Using Decision Trees
,
1998,
ECML.
[6]
Vladimir N. Vapnik,et al.
The Nature of Statistical Learning Theory
,
2000,
Statistics for Engineering and Information Science.
[7]
Thorsten Brants,et al.
TnT – A Statistical Part-of-Speech Tagger
,
2000,
ANLP.
[8]
Christopher D. Manning,et al.
Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger
,
2000,
EMNLP.