Noisy Text Normalization Using an Enhanced Language Model

User generated text in social network sites contains enormous amount and vast variety of out-of-vocabulary words, formed both deliberately and mistakenly by the end-users. It is of essential usefulness to normalize the noisy text before employing NLP tasks. This paper describes an unsupervised normalization system, which encompasses two phases: candidate generation and candidate selection. We generate candidate via six different methods: 1) one-edit distance lexically generation, 2) phonemically generation, 3) blending the previous methods, 4) two-edit distance lexically generation, 5) dictionary translation, and 6) heuristic rules. Although in candidate selection we use a trigram language model, a new method presented to select candidates with respect to all other words in the sentence. Our experiments on a large dataset show promising results.

[1]  Mohd Zakree Ahmad Nazri,et al.  Normalization of common noisy terms in Malaysian online media , 2012 .

[2]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[3]  Max Kaufmann Syntactic Normalization of Twitter Messages , 2010 .

[4]  Veronica Lopez Ludeña,et al.  Architecture for text normalization using statistical machine translation techniques , 2012 .

[5]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[6]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[8]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[9]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[10]  Kenji Araki,et al.  Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English , 2011 .

[11]  Chin Kim On,et al.  Automatic spell checker for Malay blog , 2012, 2012 IEEE International Conference on Control System, Computing and Engineering.

[12]  Norisma Idris,et al.  An architecture for Malay Tweet normalization , 2014, Inf. Process. Manag..

[13]  Takashi Onishi,et al.  Chinese Informal Word Normalization: an Experimental Study , 2013, IJCNLP.

[14]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[15]  Wei Gao,et al.  Exploring Tweets Normalization and Query Time Sensitivity for Twitter Search , 2011, TREC.

[16]  Markus Bieswanger A contrastive analysis of different shortening strategies in English and German text messages , 2007 .

[17]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[18]  Mohammad Arshi Saloot,et al.  Social Network Security Using Anomaly Detection , 2012 .

[19]  Shourya Roy,et al.  Special issue on noisy text analytics , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[20]  Rahul Goutam,et al.  Experiments with artificially generated noise for cleansing noisy text , 2011, MOCR_AND '11.

[21]  Shourya Roy,et al.  Special issue on noisy text analytics , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[22]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[23]  Yilun Shang Phase Transition in Long-Range Percolation on Bipartite Hierarchical Lattices , 2013, TheScientificWorldJournal.

[24]  Fei Liu,et al.  A Broad-Coverage Normalization System for Social Media Language , 2012, ACL.