Social Text Normalization using Contextual Graph Random Walks

We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus. We show that the proposed approach has a very high precision of (92.43) and a reasonable recall of (56.4). When used as a preprocessing step for a state-of-the-art machine translation system, the translation quality on social media text improved by 6%. The proposed approach is domain and language independent and can be deployed as a preprocessing step for any NLP application to handle social media text.

[1]  D. Vere-Jones Markov Chains , 1972, Nature.

[2]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[3]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[4]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[5]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[6]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[7]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[10]  Thad Hughes,et al.  Lexical Semantic Relatedness with Random Graph Walks , 2007, EMNLP.

[11]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[13]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[14]  L. Venkata Subramaniam,et al.  Unsupervised cleansing of noisy text , 2010, COLING.

[15]  Xu Sun,et al.  Learning Phrase-Based Spelling Error Models from Clickthrough Data , 2010, ACL.

[16]  Ye-In Chang,et al.  A hash trie filter method for approximate string matching in genomic databases , 2010, Applied Intelligence.

[17]  Fei Liu,et al.  Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision , 2011, ACL.

[18]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[19]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[20]  Eduard H. Hovy,et al.  Unsupervised Mining of Lexical Variants from Noisy Text , 2011, ULNLP@EMNLP.

[21]  William W. Cohen,et al.  Graph Based Similarity Measures for Synonym Extraction from Parsed Text , 2012, TextGraphs@ACL.

[22]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.