论文信息 - A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations

A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations

This paper describes a two-phase method for expanding abbreviations found in informal text (e.g., email, text messages, chat room conversations) using a machine translation system trained at the character level during the first phase. In this way, the system learns mappings between character-level “phrases” and is much more robust to new abbreviations than a word-level system. We generate translation models that are independent of the way in which the abbreviations are formed and show that the results show little degradation compared to when type-dependent models are trained. Our experiments on a large data set show our proposed system performs well when tested both on isolated abbreviations and, with the incorporation of a second phase utilizing an in-domain language model, in the context of neighboring words.

Yang Liu | Deana Pennell | Yang Liu | Deana Pennell

[1] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2] Jian Su,et al. A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[3] Srinivas Bangalore,et al. Bootstrapping Bilingual Data using Consensus Translation for a Multilingual Instant Messaging System , 2002, COLING.

[4] Animesh Mukherjee,et al. Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[5] Miles Osborne,et al. The Edinburgh Twitter Corpus , 2010, HLT-NAACL 2010.

[6] L. Venkata Subramaniam,et al. Unsupervised cleansing of noisy text , 2010, COLING.

[7] Fei Liu,et al. Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision , 2011, ACL.

[8] Yang Liu,et al. Toward text message normalization: Modeling abbreviation generation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Shankar Kumar,et al. Normalization of non-standard words , 2001, Comput. Speech Lang..

[10] Suzanne Stevenson,et al. An Unsupervised Model for Text Message Normalization , 2009 .

[11] Cédrick Fairon,et al. A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[12] François Yvon,et al. Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[13] Kristina Toutanova,et al. Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[14] Yang Liu,et al. Normalization of text messages for text-to-speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.

[16] Ben Hutchinson,et al. Using the Web for Language Independent Spellchecking and Autocorrection , 2009, EMNLP.

[17] Timothy Baldwin,et al. Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[18] Carlos A. Henr ´ iquez,et al. A Ngram-based Statistical Machine Translation Approach for Text Normalization on Chat-speak Style Communications , 2009 .