Normalization of Text Messages Using Character- and Phone-based Machine Translation Approaches

There are many abbreviation and non-standard words in SMS and Twitter messages. They are problematic for text-to-speech (TTS) or language processing techniques for these data. A character-based machine translation (MT) approach was previously used for normalization of non-standard words. In this paper, we propose a two-stage translation method to leverage phonetic information, where non-standard words are first translated to possible pronunciations, which are then translated to standard words. We further combine it with the single-step character-based translation module. Our experiments show that our proposed method significantly outperforms previous results in both n-best coverage and 1-best accuracy.

[1]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[3]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[4]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[5]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[6]  Yang Liu,et al.  A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations , 2011, IJCNLP.

[7]  Fei Liu,et al.  Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision , 2011, ACL.

[8]  Kam-Fai Wong,et al.  Normalization of Chinese chat language , 2008, Lang. Resour. Evaluation.

[9]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[10]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[11]  Brian D. Davison,et al.  Normalizing Microtext , 2011, Analyzing Microtext.

[12]  Yang Liu,et al.  Normalization of text messages for text-to-speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[14]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[15]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[16]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[17]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[18]  L. Venkata Subramaniam,et al.  Unsupervised cleansing of noisy text , 2010, COLING.