Some Issues on the Normalization of a Corpus of Products Reviews in Portuguese

This paper describes the analysis of different kinds of noises in a corpus of products reviews in Brazilian Portuguese. Case folding, punctuation, spelling and the use of internet slang are the major kinds of noise we face. After noting the effect of these noises on the POS tagging task, we propose some procedures to minimize them.

[1]  Tullio De Mauro,et al.  Guida all'uso delle parole , 1980 .

[2]  Hercules Dalianis,et al.  Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike , 2009, ACL.

[3]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[4]  Alexander Mehler,et al.  Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems , 2008, LREC.

[5]  Tanja Schultz,et al.  Text normalization based on statistical machine translation and internet user support , 2010, INTERSPEECH.

[6]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[7]  Klaus U. Schulz,et al.  Orthographic Errors in Web Pages: Toward Cleaner Web Corpora , 2006, Computational Linguistics.

[8]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[9]  Felice Dell'Orletta,et al.  ULISSE: an Unsupervised Algorithm for Detecting Reliable Dependency Parses , 2011, CoNLL.

[10]  Tomaz Erjavec,et al.  hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene , 2011, TSD.

[11]  Bernd Bohnet,et al.  Very high accuracy and fast dependency parsing is not a contradiction , 2010, COLING 2010.

[12]  Neville Ryant,et al.  A large-scale classification of English verbs , 2008, Lang. Resour. Evaluation.

[13]  L. Venkata Subramaniam,et al.  Unsupervised cleansing of noisy text , 2010, COLING.

[14]  Lucian Vlad Lita,et al.  tRuEcasIng , 2003, ACL.

[15]  Sandra M. Aluísio,et al.  An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese , 2003, PROPOR.

[16]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[17]  Nathan Hartmann,et al.  A Large Corpus of Product Reviews in Portuguese: Tackling Out-Of-Vocabulary Words , 2014, LREC.

[18]  Felice Dell'Orletta,et al.  Accurate Dependency Parsing with a Stacked Multilayer Perceptron , 2009 .

[19]  Verena Lyding,et al.  xLDD: Extended Linguistic Dependency Diagrams , 2011, 2011 15th International Conference on Information Visualisation.

[20]  Marina Santini,et al.  Genres in formation? An exploratory study of web pages using cluster analysis , 2005 .

[21]  Nuno Cardoso Rembrandt - a named-entity recognition framework , 2012, LREC.

[22]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[23]  Adrien Barbaresi,et al.  The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction , 2013 .

[24]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[25]  Alexander Mehler,et al.  Riding the Rough Waves of Genre on the Web , 2011, Genres on the Web.

[26]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[27]  Eugene Charniak,et al.  Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[28]  Felice Dell'Orletta,et al.  Ensemble system for Part-of-Speech tagging , 2009 .

[29]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[30]  Nikola Ljubesic,et al.  Lemmatization and Morphosyntactic Tagging of Croatian and Serbian , 2013, BSNLP@ACL.

[31]  Roland Schäfer,et al.  Building Large Corpora from the Web Using a New Efficient Tool Chain , 2012, LREC.

[32]  Eric Laporte,et al.  UNITEX-PB, a set of flexible language resources for Brazilian Portuguese , 2005 .

[33]  Serge Sharo Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .

[34]  Michel Généreux,et al.  A Large Portuguese Corpus On-Line: Cleaning and Preprocessing , 2012, PROPOR.

[35]  Adam Kilgarriff,et al.  Cleaneval: a Competition for Cleaning Web Pages , 2008, LREC.

[36]  Slav Petrov,et al.  Overview of the 2012 Shared Task on Parsing the Web , 2012 .

[37]  Egon Stemle,et al.  Open Corpus Interface for Italian Language Learning , 2013 .

[38]  Felice Dell'Orletta,et al.  Unsupervised Linguistically-Driven Reliable Dependency Parses Detection and Self-Training for Adaptation to the Biomedical Domain , 2013, BioNLP@ACL.

[39]  Sara Castagnoli,et al.  I testi del web: una proposta di classificazione sulla base del corpus PAISÀ , 2011 .

[40]  Alessandro Lenci,et al.  LexIt: A Computational Resource on Italian Argument Structure , 2012, LREC.

[41]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[42]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[43]  Fernando Batista,et al.  Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news , 2008, Speech Commun..

[44]  Zeljko Agic,et al.  Parsing Croatian and Serbian by Using Croatian Dependency Treebanks , 2013, SPMRL@EMNLP.

[45]  Françoise Beaufays,et al.  Language model capitalization , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Stefan Evert,et al.  Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium , 2011 .