Holaaa!! writin like u talk is kewl but kinda hard 4 NLP

We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker.

[1]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[2]  James C. Spall,et al.  AN OVERVIEW OF THE SIMULTANEOUS PERTURBATION METHOD FOR EFFICIENT OPTIMIZATION , 1998 .

[3]  Josef van Genabith,et al.  Comparing the Use of Edited and Unedited Text in Parser Self-Training , 2011, IWPT.

[4]  Kenneth Ward Church,et al.  Estimation Procedures for Language Context: Poor Estimates are Worse than None , 1990 .

[5]  Koenraad De Smedt,et al.  Triphone Analysis: A Combined Method for the Correction of Orthographical and Typographical Errors. , 1988, ANLP.

[6]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[7]  Jennifer Foster "cba to check the spelling": Investigating Parser Performance on Discussion Forum Posts , 2010, HLT-NAACL.

[8]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[9]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[10]  Marcello Federico,et al.  How Many Bits Are Needed To Store Probabilities for Phrase-Based Translation? , 2006, WMT@HLT-NAACL.

[11]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[12]  Emmanuel J. Yannakoudakis,et al.  The rules of spelling errors , 1983, Inf. Process. Manag..

[13]  Allen R. Hanson,et al.  A Contextual Postprocessing System for Error Correction Using Binary n-Grams , 1974, IEEE Transactions on Computers.

[14]  Judith Domingo,et al.  User-Centred Design of Error Correction Tools , 2008, LREC.

[15]  Laura Alonso Alemany INSIghTS LINGüÍSTICOS RELATIVOS A LA NORMALIzACIÓN LÉxICA DE CONTENIDOS GENERADOS POR USUARIOS LINguISTIC INSIghTS oN ThE LExICAL NoRMALIzATIoN oF uSER-gENERATED CoNTENT , 2010 .