论文信息 - Normalizing Microtext

Normalizing Microtext

The use of computer mediated communication has resulted in a new form of written text--Microtext--which is very different from well-written text. Tweets and SMS messages, which have limited length and may contain misspellings, slang, or abbreviations, are two typical examples of microtext. Microtext poses new challenges to standard natural language processing tools which are usually designed for well-written text. The objective of this work is to normalize microtext, in order to produce text that could be suitable for further treatment. We propose a normalization approach based on the source channel model, which incorporates four factors, namely an orthographic factor, a phonetic factor, a contextual factor and acronym expansion. Experiments show that our approach can normalize Twitter messages reasonably well, and it outperforms existing algorithms on a public SMS data set.

Brian D. Davison | Dawei Yin | Zhenzhen Xue

[1] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2] Animesh Mukherjee,et al. Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[3] Shankar Kumar,et al. Normalization of non-standard words , 2001, Comput. Speech Lang..

[4] Timothy Baldwin,et al. Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[5] Fred J. Damerau,et al. A technique for computer detection and correction of spelling errors , 1964, CACM.

[6] Jian Su,et al. A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[7] Xiaolong Li,et al. An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[8] L. Venkata Subramaniam,et al. SMS based Interface for FAQ Retrieval , 2009, ACL.

[9] Martha Palmer,et al. Twitter in mass emergency: what NLP techniques can contribute , 2010, HLT-NAACL 2010.

[10] Jeffrey Ellen,et al. All about Microtext - A Working Definition and a Survey of Current Microtext Research within Artificial Intelligence and Natural Language Processing , 2011, ICAART.

[11] Grzegorz Kondrak,et al. Automatic Syllabification with Structured SVMs for Letter-to-Phoneme Conversion , 2008, ACL.