Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese

BackgroundPart-of-speech tagging is an important preprocessing step in many natural language processing applications. Despite much work already carried out in this field, there is still room for improvement, especially in Portuguese. We experiment here with an architecture based on neural networks and word embeddings, and that has achieved promising results in English.MethodsWe tested our classifier in different corpora: a new revision of the Mac-Morpho corpus, in which we merged some tags and performed corrections and two previous versions of it. We evaluate the impact of using different types of word embeddings and explicit features as input.ResultsWe compare our tagger’s performance with other systems and achieve state-of-the-art results in the new corpus. We show how different methods for generating word embeddings and additional features differ in accuracy.ConclusionsThe work reported here contributes with a new revision of the Mac-Morpho corpus and a state-of-the-art new tagger available for use out-of-the-box.

[1]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[2]  Cícero Nogueira dos Santos,et al.  Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[3]  Cícero Nogueira dos Santos,et al.  Training State-of-the-Art Portuguese POS Taggers without Handcrafted Features , 2014, PROPOR.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Eckhard Bick,et al.  Floresta Sintá(c)tica: A treebank for Portuguese , 2002, LREC.

[6]  Violeta Seretan,et al.  Proceedings of The Third International Conference on Language Resources and Evaluation (LREC 2002) , 2002 .

[7]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[8]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[9]  Ted Pedersen,et al.  Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas, Los Angeles, CA, USA, June 6, 2010 , 2010, NAACL.

[10]  João Luís Garcia Rosa,et al.  A two-step convolutional neural network approach for semantic role labeling , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[11]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[12]  Miguel Toro,et al.  Advances in Artificial Intelligence — IBERAMIA 2002 , 2002, Lecture Notes in Computer Science.

[13]  Joachim Bingel,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics , 2016 .

[14]  F. Osório,et al.  Journal of the Brazilian Computer Society , 2009 .

[15]  Alexander Yates,et al.  Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling , 2009, ACL.

[16]  Renata Vieira,et al.  Proceedings of the 10th international conference on Computational Processing of the Portuguese Language , 2003 .

[17]  Marcelo Finger,et al.  Variable-Length Markov Models and Ambiguous Words in Portuguese , 2010, NAACL.

[18]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[19]  Kallirroi Georgila,et al.  Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08) , 2008 .

[20]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[21]  Sandra M. Aluísio,et al.  An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese , 2003, PROPOR.

[22]  Jaime Simão Sichman,et al.  Advances in Artificial Intelligence - IBERAMIA-SBIA 2006 , 2006 .

[23]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[24]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[25]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[26]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[27]  Jaime Simão Sichman,et al.  Advances in Artificial Intelligence - IBERAMIA-SBIA 2006: 2nd International Joint Conference, 10th Ibero-American Conference on AI, 18th Brazilian AI Symposium, ... 2006 (Lecture Notes in Computer Science) , 2006 .

[28]  Geraldo Xexéo,et al.  Part-of-Speech Tagging of Portuguese Using Hidden Markov Models with Character Language Model Emissions , 2011, STIL.

[29]  João Luís Garcia Rosa,et al.  Mac-Morpho Revisited: Towards Robust Part-of-Speech Tagging , 2013, STIL.

[30]  Sandra M. Aluísio,et al.  Some Issues on the Normalization of a Corpus of Products Reviews in Portuguese , 2014, WaC@EACL.

[31]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[32]  Eraldo Rezende Fernandes,et al.  Entropy-Guided Feature Generation for Structured Learning of Portuguese Dependency Parsing , 2012, PROPOR.

[33]  Marcelo Finger,et al.  Comparing Two Markov Methods for Part-of-Speech Tagging of Portuguese , 2006, IBERAMIA-SBIA.

[34]  Solange Martins Jordão Pontifícia Universidade Católica Do Rio de Janeiro , 2008 .

[35]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[36]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[37]  Ronan Collobert,et al.  Deep Learning for Efficient Discriminative Parsing , 2011, AISTATS.