Unsupervised Mining of Lexical Variants from Noisy Text

The amount of data produced in user-generated content continues to grow at a staggering rate. However, the text found in these media can deviate wildly from the standard rules of orthography, syntax and even semantics and present significant problems to downstream applications which make use of this noisy data. In this paper we present a novel unsupervised method for extracting domain-specific lexical variants given a large volume of text. We demonstrate the utility of this method by applying it to normalize text messages found in the online social media service, Twitter, into their most likely standard English versions. Our method yields a 20% reduction in word error rate over an existing state-of-the-art approach.

[1]  Srinivas Bangalore,et al.  Bootstrapping Bilingual Data using Consensus Translation for a Multilingual Instant Messaging System , 2002, COLING.

[2]  Rahul Bhagat,et al.  Large Scale Acquisition of Paraphrases for Learning Surface Patterns , 2008, ACL.

[3]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[4]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[5]  Barbara Plank,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 .

[6]  L. Venkata Subramaniam,et al.  Unsupervised cleansing of noisy text , 2010, COLING.

[7]  E. Hovy,et al.  Contextual Bearing on Linguistic Variation in Social Media , 2011 .

[8]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[9]  Mi-Young Kim,et al.  Transliteration Generation and Mining with Limited Training Resources , 2010, NEWS@ACL.

[10]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[11]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[12]  Marius Pasca,et al.  Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web , 2005, IJCNLP.

[13]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[14]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[15]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[16]  Max Kaufmann Syntactic Normalization of Twitter Messages , 2010 .

[17]  François Yvon,et al.  Rewriting the orthography of SMS messages , 2010, Natural Language Engineering.