A Large Portuguese Corpus On-Line: Cleaning and Preprocessing

We present a newly available on-line resource for Portuguese, a corpus of 310 million words, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. Here we report on work carried out on the corpus previous to its publication on-line. We focus on the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries.

[1]  Michel Généreux,et al.  Lexical analysis of pre and post revolution discourse in Portugal , 2010, LREC 2010.

[2]  Amália Mendes,et al.  On the use of comparable corpora of African varieties of Portuguese for linguistic description and teaching/learning applications , 2008, LREC 2008.

[3]  António Horta Branco,et al.  Contractions: Breaking the Tokenization-Tagging Circularity , 2003, PROPOR.

[4]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[5]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[6]  Amália Mendes,et al.  Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project , 2006, LREC.

[7]  Tony Berber Sardinha History and compilation of a large registerdiversified corpus of portuguese at cepril , 2007 .

[8]  Sandra M. Aluísio,et al.  The Lácio-Web: Corpora and Tools to Advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools , 2004, LREC.

[9]  Walter Daelemans,et al.  Memory-Based Morphological Analysis , 1999, ACL.

[10]  Thiago Alexandre Salgueiro Pardo,et al.  Computational Processing of the Portuguese Language - 11th International Conference, PROPOR 2014, São Carlos/SP, Brazil, October 6-8, 2014. Proceedings , 2014, Lecture Notes in Computer Science.

[11]  Luísa Pereira,et al.  Portuguese Corpora at CLUL , 2000, LREC.

[12]  António Branco,et al.  A Suite of Shallow Processing Tools for Portuguese: LX-Suite , 2006, EACL.

[13]  Sandra M. Aluísio,et al.  An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese , 2003, PROPOR.

[14]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[15]  Diana Santos Linguateca's infrastructure for Portuguese and how it allows the detailed study of language varieties , 2011 .

[16]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[17]  Stefan Evert A Lightweight and Efficient Tool for Cleaning Web Pages , 2008, LREC.

[18]  Diana Santos,et al.  Evaluating CETEMPúblico, a Free Resource for Portuguese , 2001, ACL.

[19]  António Branco,et al.  Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese , 2004, LREC.