Tagsets and Datasets: Some Experiments Based on Portuguese Language

We report the results of two experiments aimed at investigating the impact of linguistic variation on PoS tagging. In both cases, we depart from the conversion of the corpus MacMorpho [1], which was re-annotated according to the Universal Dependencies PoS tagset. Throughout the conversion process, we faced some linguistic challenges related to the past participle forms. As a result, we created two corpora (MacMoprho-UD and MacMorpho-UD+PCP). We used these three corpora (MacMorpho; MacMoprho-UD and MacMorpho-UD+PCP) to assess the impact on PoS learning in different scenarios.

[1]  Pablo Gamallo,et al.  PoS-tagging the Web in Portuguese. National varieties, text typologies and spelling systems , 2014, Proces. del Leng. Natural.

[2]  Diana Santos,et al.  POS tagging: clarificação histórico-terminológica , 2009 .

[3]  Elliott Macklovitch Where the Tagger Falters , 2005 .

[4]  Pablo Gamallo,et al.  A rule-based system for cross-lingual parsing of Romance languages with Universal Dependencies , 2017, CoNLL Shared Task.

[5]  Christopher D. Manning Computational Linguistics and Deep Learning , 2015, Computational Linguistics.

[6]  Yan Huang,et al.  Anchoring and Agreement in Syntactic Annotations , 2016, EMNLP.

[7]  Sandra M. Aluísio,et al.  Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese , 2014, Journal of the Brazilian Computer Society.

[8]  Sylvain Auroux,et al.  La révolution technologique de la grammatisation. Introduction à l'histoire des sciences du langage, , 1994 .

[9]  Sandra M. Aluísio,et al.  An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese , 2003, PROPOR.

[10]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[11]  Eckhard Bick,et al.  Universal Dependencies for Portuguese , 2017, DepLing.

[12]  João Luís Garcia Rosa,et al.  Mac-Morpho Revisited: Towards Robust Part-of-Speech Tagging , 2013, STIL.

[13]  Celso Ferreira da Cunha,et al.  Nova gramática do português contemporâneo , 1985 .

[14]  Adam Kilgarriff,et al.  Corpus tools for lexicographers , 2011 .

[15]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.