This paper presents the system developed by Elhuyar for the TweetNorm evaluation campaign which consists of normalizing Spanish tweets to standard language. The normalization covers only the correction of certain Out Of Vocabulary (OOV) words, previously identified by the organizers. The developed system follows a two step strategy. First, candidates for each OOV word are generated by means of various methods dealing with the different error-sources: extension of usual abbreviations, correction of colloquial forms, correction of replication of characters, normalization of interjections, and correction of spelling errors by means of editdistance metrics. Next, the correct candidates are selected using a language model trained on correct Spanish text corpora. The system obtained a 68.3% accuracy on the development set, and 63.36% on the test set, being the 4th ranked system on the evaluation campaign.
[1]
Samuel Reese,et al.
FreeLing 2.1: Five Years of Open-source Language Processing Tools
,
2010,
LREC.
[2]
Mans Hulden,et al.
Foma: a Finite-State Compiler and Library
,
2009,
EACL.
[3]
Timothy Baldwin,et al.
Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
,
2011,
ACL.
[4]
Fei Liu,et al.
A Broad-Coverage Normalization System for Social Media Language
,
2012,
ACL.
[5]
Andreas Stolcke,et al.
SRILM - an extensible language modeling toolkit
,
2002,
INTERSPEECH.
[6]
Josef van Genabith,et al.
#hardtoparse: POS Tagging and Parsing the Twitterverse
,
2011,
Analyzing Microtext.
[7]
Ming Zhou,et al.
Recognizing Named Entities in Tweets
,
2011,
ACL.