Choosing a Spanish Part-of-Speech tagger for a lexically sensitive task

In this article, four Part-of-Speech (PoS) taggers for Spanish are compared. The evaluation has been carried out without prior training or tuning of the PoS taggers. To allow for a comparison across PoS taggers, their tagsets have been mapped to the universal PoS tagset (Petrov, Das and McDonald, 2012). The PoS taggers have also been compared as regards the information they provide and how they treat special features of the Spanish language such as verbal clitics and portmanteaux.

[1]  Egoitz Laparra,et al.  Multilingual Central Repository version 3.0 , 2012, LREC.

[2]  Carla Parra Escartín Design and compilation of a specialized Spanish-German parallel corpus , 2012, LREC.

[3]  Xavier Carreras,et al.  FreeLing: An Open-Source Suite of Language Analyzers , 2004, LREC.

[4]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[5]  Antske Fokkens,et al.  NAF and GAF: Linking Linguistic Annotations , 2014 .

[6]  German Rigau,et al.  IXA pipeline: Efficient and Ready to Use Multilingual NLP tools , 2014, LREC.

[7]  José Miguel Goñi-Menoyo,et al.  GRAMPAL: A Morphological Processor for Spanish implemented in Prolog , 1995, GULP-PRODE.

[8]  Stephan Oepen,et al.  Tokenization: Returning to a Long Solved Problem — A Survey, Contrastive Experiment, Recommendations, and Toolkit — , 2012, ACL.

[9]  Attila Novák,et al.  Hybrid Text Segmentation for Hungarian Clinical Records , 2013, MICAI.

[10]  Murhaf Fares,et al.  Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes , 2013, CICLing.

[11]  Adwait Ratnaparkhi,et al.  Learning to Parse Natural Language with Maximum Entropy Models , 1999, Machine Learning.

[12]  Horacio Rodríguez,et al.  A Machine Learning Approach to POS Tagging , 2000, Machine Learning.

[13]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[14]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[15]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[16]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[17]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[18]  Mariona Taulé,et al.  AnCora: Multilevel Annotated Corpora for Catalan and Spanish , 2008, LREC.