Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

Most text message normalization approaches are based on supervised learning and rely on human labeled training data. In addition, the nonstandard words are often categorized into different types and specific models are designed to tackle each type. In this paper, we propose a unified letter transformation approach that requires neither pre-categorization nor human supervision. Our approach models the generation process from the dictionary words to nonstandard tokens under a sequence labeling framework, where each letter in the dictionary word can be retained, removed, or substituted by other letters/digits. To avoid the expensive and time consuming hand labeling process, we automatically collected a large set of noisy training pairs using a novel web-based approach and performed character-level alignment for model training. Experiments on both Twitter and SMS messages show that our system significantly outperformed the state-of-the-art deletion-based abbreviation system and the jazzy spell checker (absolute accuracy gain of 21.69% and 18.16% over jazzy spell checker on the two test sets respectively).

[1]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[2]  Dong Yang,et al.  Automatic Chinese Abbreviation Generation Using Conditional Random Field , 2009, NAACL.

[3]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[6]  Miles Osborne,et al.  The Edinburgh Twitter Corpus , 2010, HLT-NAACL 2010.

[7]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[8]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[9]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[10]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[11]  Dilek Z. Hakkani-Tür,et al.  Probabilistic model-based sentiment analysis of twitter messages , 2010, 2010 IEEE Spoken Language Technology Workshop.

[12]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[13]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[14]  Yang Liu,et al.  Normalization of text messages for text-to-speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.