A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations

This paper describes a two-phase method for expanding abbreviations found in informal text (e.g., email, text messages, chat room conversations) using a machine translation system trained at the character level during the first phase. In this way, the system learns mappings between character-level “phrases” and is much more robust to new abbreviations than a word-level system. We generate translation models that are independent of the way in which the abbreviations are formed and show that the results show little degradation compared to when type-dependent models are trained. Our experiments on a large data set show our proposed system performs well when tested both on isolated abbreviations and, with the incorporation of a second phase utilizing an in-domain language model, in the context of neighboring words.

[1]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[3]  Srinivas Bangalore,et al.  Bootstrapping Bilingual Data using Consensus Translation for a Multilingual Instant Messaging System , 2002, COLING.

[4]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[5]  Miles Osborne,et al.  The Edinburgh Twitter Corpus , 2010, HLT-NAACL 2010.

[6]  L. Venkata Subramaniam,et al.  Unsupervised cleansing of noisy text , 2010, COLING.

[7]  Fei Liu,et al.  Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision , 2011, ACL.

[8]  Yang Liu,et al.  Toward text message normalization: Modeling abbreviation generation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[10]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[11]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[12]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[13]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[14]  Yang Liu,et al.  Normalization of text messages for text-to-speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[16]  Ben Hutchinson,et al.  Using the Web for Language Independent Spellchecking and Autocorrection , 2009, EMNLP.

[17]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[18]  Carlos A. Henr ´ iquez,et al.  A Ngram-based Statistical Machine Translation Approach for Text Normalization on Chat-speak Style Communications , 2009 .