论文信息 - A Large Portuguese Corpus On-Line: Cleaning and Preprocessing

A Large Portuguese Corpus On-Line: Cleaning and Preprocessing

We present a newly available on-line resource for Portuguese, a corpus of 310 million words, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. Here we report on work carried out on the corpus previous to its publication on-line. We focus on the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries.

[1] Michel Généreux,et al. Lexical analysis of pre and post revolution discourse in Portugal , 2010, LREC 2010.

[2] Amália Mendes,et al. On the use of comparable corpora of African varieties of Portuguese for linguistic description and teaching/learning applications , 2008, LREC 2008.

[3] António Horta Branco,et al. Contractions: Breaking the Tokenization-Tagging Circularity , 2003, PROPOR.

[4] Thorsten Joachims,et al. Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[5] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[6] Amália Mendes,et al. Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project , 2006, LREC.

[7] Tony Berber Sardinha. History and compilation of a large registerdiversified corpus of portuguese at cepril , 2007 .

[8] Sandra M. Aluísio,et al. The Lácio-Web: Corpora and Tools to Advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools , 2004, LREC.

[9] Walter Daelemans,et al. Memory-Based Morphological Analysis , 1999, ACL.

[10] Thiago Alexandre Salgueiro Pardo,et al. Computational Processing of the Portuguese Language - 11th International Conference, PROPOR 2014, São Carlos/SP, Brazil, October 6-8, 2014. Proceedings , 2014, Lecture Notes in Computer Science.

[11] Luísa Pereira,et al. Portuguese Corpora at CLUL , 2000, LREC.

[12] António Branco,et al. A Suite of Shallow Processing Tools for Portuguese: LX-Suite , 2006, EACL.

[13] Sandra M. Aluísio,et al. An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese , 2003, PROPOR.

[14] Walter Daelemans,et al. Memory-Based Language Processing , 2009, Studies in natural language processing.

[15] Diana Santos. Linguateca's infrastructure for Portuguese and how it allows the detailed study of language varieties , 2011 .

[16] Walter Daelemans,et al. MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[17] Stefan Evert. A Lightweight and Efficient Tool for Cleaning Web Pages , 2008, LREC.

[18] Diana Santos,et al. Evaluating CETEMPúblico, a Free Resource for Portuguese , 2001, ACL.

[19] António Branco,et al. Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese , 2004, LREC.