Elhuyar at Tweet-Norm 2013

This paper presents the system developed by Elhuyar for the TweetNorm evaluation campaign which consists of normalizing Spanish tweets to standard language. The normalization covers only the correction of certain Out Of Vocabulary (OOV) words, previously identified by the organizers. The developed system follows a two step strategy. First, candidates for each OOV word are generated by means of various methods dealing with the different error-sources: extension of usual abbreviations, correction of colloquial forms, correction of replication of characters, normalization of interjections, and correction of spelling errors by means of editdistance metrics. Next, the correct candidates are selected using a language model trained on correct Spanish text corpora. The system obtained a 68.3% accuracy on the development set, and 63.36% on the test set, being the 4th ranked system on the evaluation campaign.