Developing New Linguistic Resources and Tools for the Galician Language

In this paper we describe the work towards developing new resources and Natural Language Processing (NLP) tools for the Galician language. First, a new corpus, manually revised, for POS tagging and lemmatization is described. Second, we present a new manually annotated corpus for Named Entity tagging for Galician. Third, we train and develop new NLP tools for Galician, including the first publicly available Galician statistical modules for lemmatization and Named Entity Recognition, and new modules for POS tagging, Wikification and Named Entity Disambiguation. Finally, we also present two new Web demo applications to easily test the new set of tools online.

[1]  Xavier Gómez Guinovart,et al.  Termonet: Construcción de terminologías a partir de WordNet y corpus especializados , 2015, Proces. del Leng. Natural.

[2]  Pablo N. Mendes,et al.  Improving efficiency and accuracy in multilingual entity extraction , 2013, I-SEMANTICS '13.

[3]  German Rigau,et al.  Robust multilingual Named Entity Recognition with shallow semi-supervised features , 2016, Artif. Intell..

[4]  Xavier Gómez Guinovart,et al.  DBpedia del gallego: recursos y aplicaciones en procesamiento del lenguaje , 2016, Proces. del Leng. Natural.

[5]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[6]  Xavier Gómez Guinovart,et al.  Building the Galician wordnet: methods and applications , 2018, Lang. Resour. Evaluation.

[7]  German Rigau,et al.  IXA pipeline: Efficient and Ready to Use Multilingual NLP tools , 2014, LREC.

[8]  Núria Bel,et al.  CLARIN Centro-K-español , 2016, Proces. del Leng. Natural.

[9]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[10]  Eneko Agirre,et al.  Random Walks for Knowledge-Based Word Sense Disambiguation , 2014, CL.

[11]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[12]  Xavier Gómez Guinovart,et al.  Anotación morfosintáctica do Corpus Técnico do Galego , 2009, Linguamática.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[15]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[16]  Xavier Gómez Guinovart,et al.  Diseño y elaboración del corpus SemCor del gallego anotado semánticamente con WordNet 3.0 , 2017, Proces. del Leng. Natural.

[17]  Rodrigo Agerri,et al.  EliXa: A Modular and Flexible ABSA Platform , 2015, *SEMEVAL.