A Normalizer for UGC in Brazilian Portuguese

User-generated contents (UGC) represent an important source of information for governments, companies, political candidates and consumers. However, most of the Natural Language Processing tools and techniques are developed from and for texts of standard language, and UGC is a type of text especially full of creativity and idiosyncrasies, which represents noise for NLP purposes. This paper presents UGCNormal, a lexicon-based tool for UGC normalization. It encompasses a tokenizer, a sentence segmentation tool, a phonetic-based speller and some lexicons, which were originated from a deep analysis of a corpus of product reviews in Brazilian Portuguese. The normalizer was evaluated in two different data sets and carried out from 31% to 89% of the appropriate corrections, depending on the type of text noise. The use of UGCNormal was also validated in a task of POS tagging, which improved from 91.35% to 93.15% in accuracy and in a task of opinion classification, which improved the average of F1-score measures (F1-score positive and F1-score negative) from 0.736 to 0.758.

[1]  Tanja Schultz,et al.  Text normalization based on statistical machine translation and internet user support , 2010, INTERSPEECH.

[2]  Fei Liu,et al.  Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision , 2011, ACL.

[3]  Maria das Graças Volpe Nunes,et al.  Lexicon-Based Sentiment Analysis for Reviews of Products in Brazilian Portuguese , 2014, 2014 Brazilian Conference on Intelligent Systems.

[4]  Klaus U. Schulz,et al.  Orthographic Errors in Web Pages: Toward Cleaner Web Corpora , 2006, Computational Linguistics.

[5]  Magali Sanches Duran,et al.  Towards a Phonetic Brazilian Portuguese Spell Checker , 2014 .

[6]  Thiago Alexandre Salgueiro Pardo,et al.  Experiments on Sentence Boundary Detection in User-Generated Web Content , 2015, CICLing.

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  Nathan Hartmann,et al.  A Large Corpus of Product Reviews in Portuguese: Tackling Out-Of-Vocabulary Words , 2014, LREC.

[9]  Grzegorz Chrupala,et al.  Normalizing tweets with edit scripts and recurrent neural embeddings , 2014, ACL.

[10]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[11]  Elena Lloret,et al.  Towards Facilitating the Accessibility of Web 2 . 0 Texts through Text Normalisation , 2012 .

[12]  Dirk Hovy,et al.  What’s in a p-value in NLP? , 2014, CoNLL.

[13]  Kenji Araki,et al.  Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English , 2011 .

[14]  Sandra M. Aluísio,et al.  An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese , 2003, PROPOR.

[15]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[16]  L. Venkata Subramaniam,et al.  Unsupervised cleansing of noisy text , 2010, COLING.

[17]  Sandra M. Aluísio,et al.  Some Issues on the Normalization of a Corpus of Products Reviews in Portuguese , 2014, WaC@EACL.

[18]  Radu Ion,et al.  Bermuda, a data-driven tool for phonetic transcription of words , 2012 .

[19]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[20]  Eric Laporte,et al.  UNITEX-PB, a set of flexible language resources for Brazilian Portuguese , 2005 .

[21]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.