The AnIta-Lemmatiser: A Tool for Accurate Lemmatisation of Italian Texts

This paper presents the AnIta-Lemmatiser, an automatic tool to lemmatise Italian texts. It is based on a powerful morphological analyser enriched with a large lexicon and some heuristic techniques to select the most appropriate lemma among those that can be morphologically associated to an ambiguous wordform. The heuristics are essentially based on the frequency-of-use tags provided by the De Mauro/Paravia electronic dictionary. The AnIta-Lemmatiser ranked at the second place in the Lemmatisation Task of the EVALITA 2011 evaluation campaign. Beyond the official lemmatiser used for EVALITA, some further improvements are presented.

[1]  Ulrich Heid,et al.  SMOR: A German Computational Morphology Covering Derivation, Composition and Inflection , 2004, LREC.

[2]  Zdravko Dovedan,et al.  Evaluating Full Lemmatization of Croatian Texts , 2009 .

[3]  Mourad Gridach,et al.  XMODEL : An XML-based Morphological Analyzer for Arabic Language , 2010 .

[4]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[5]  Markus Dickinson,et al.  Computational approaches to morphology and syntax (review) , 2010 .

[6]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[7]  Dominic Widdows,et al.  Geometry and Meaning , 2004, Computational Linguistics.

[8]  Amália Mendes,et al.  Reusing Available Resources for Tagging a Spoken Portuguese Corpus , 2003 .

[9]  Marco Baroni,et al.  Morph-it! A free corpus-based morphological resource for the Italian language , 2005 .

[10]  Fabio Tamburini,et al.  The EVALITA 2011 Lemmatisation Task , 2011 .

[11]  R. R. Favretti,et al.  CORIS/CODIS: A corpus of written Italian based on a defined and a dynamic model , 2002 .

[12]  Çağrı Çöltekin,et al.  A Freely Available Morphological Analyzer for Turkish , 2010, LREC.

[13]  Emanuele Pianta,et al.  The TextPro Tool Suite , 2008, LREC.

[14]  Vito Pirrelli,et al.  Monotonic Paradigmatic Schemata in Italian Verb Inflection , 1996, COLING.

[15]  Francesca Carota Derivational Morphology of Italian: Principles for Formalization , 2006 .

[16]  Dunja Mladenic,et al.  Ripple Down Rule learning for automated word lemmatisation , 2008, AI Commun..

[17]  Lars Borin,et al.  Unsupervised Learning of Morphology , 2011, CL.

[18]  Fabio Tamburini,et al.  AnIta: a powerful morphological analyser for Italian , 2012, LREC.

[19]  G. Devoto,et al.  Il dizionario della lingua italiana , 1990 .

[20]  Tullio De Mauro,et al.  Il dizionario della lingua italiana , 2000 .

[21]  Tullio De Mauro,et al.  Guida all'uso delle parole , 1980 .

[22]  Markus Walther Computational nonlinear morphology with emphasis on semitic languages , 2002, Computational Linguistics.

[23]  Lauri Karttunen,et al.  Finite State Morphology , 2003, CSLI Studies in Computational Linguistics.

[24]  Tommi A. Pirinen,et al.  HFST Tools for Morphology - An Efficient Open-Source Package for Construction of Morphological Analyzers , 2009, SFCM.

[25]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[26]  Walter Daelemans,et al.  Lemmatisation and morphosyntactic annotation for the spoken Dutch corpus , 1999, CLIN.

[27]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[28]  Eija Airio Word normalization and decompounding in mono- and bilingual IR , 2006, Information Retrieval.

[29]  Rodolfo Delmonte,et al.  Computational Linguistic Text Processing – Lexicon, Grammar, Parsing and Anaphora Resolution , 2008 .

[30]  Mark Liberman,et al.  A Finite-State Morphological Processor For Spanish , 1990, COLING.