Normalizing Microtext

The use of computer mediated communication has resulted in a new form of written text--Microtext--which is very different from well-written text. Tweets and SMS messages, which have limited length and may contain misspellings, slang, or abbreviations, are two typical examples of microtext. Microtext poses new challenges to standard natural language processing tools which are usually designed for well-written text. The objective of this work is to normalize microtext, in order to produce text that could be suitable for further treatment. We propose a normalization approach based on the source channel model, which incorporates four factors, namely an orthographic factor, a phonetic factor, a contextual factor and acronym expansion. Experiments show that our approach can normalize Twitter messages reasonably well, and it outperforms existing algorithms on a public SMS data set.

[1]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[3]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[4]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[5]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[6]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[7]  Xiaolong Li,et al.  An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[8]  L. Venkata Subramaniam,et al.  SMS based Interface for FAQ Retrieval , 2009, ACL.

[9]  Martha Palmer,et al.  Twitter in mass emergency: what NLP techniques can contribute , 2010, HLT-NAACL 2010.

[10]  Jeffrey Ellen,et al.  All about Microtext - A Working Definition and a Survey of Current Microtext Research within Artificial Intelligence and Natural Language Processing , 2011, ICAART.

[11]  Grzegorz Kondrak,et al.  Automatic Syllabification with Structured SVMs for Letter-to-Phoneme Conversion , 2008, ACL.

[12]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[13]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[14]  Suzanne Stevenson,et al.  C re at iv ity An Unsupervised Model for Text Message Normalization , 2009 .

[15]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[16]  Dan Roth,et al.  Applying Winnow to Context-Sensitive Spelling Correction , 1996, ICML.

[17]  Grzegorz Kondrak,et al.  Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion , 2008, ACL.

[18]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[19]  Anil Kumar Singh,et al.  Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training , 2009, HLT-NAACL.

[20]  Antal van den Bosch,et al.  Improved morpho-phonological sequence processing with constraint satisfaction inference , 2006, SIGMORPHON.

[21]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[22]  William Murnane Improving Accuracy of Named Entity Recognition on Social Media Data , 2010 .

[23]  Eric Brill,et al.  Automatic Rule Acquisition for Spelling Correction , 1997, ICML.

[24]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.