Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn't require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.

[1]  Mohammed Bennamoun,et al.  Integrated Scoring For Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text , 2006, AusDM.

[2]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[3]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[4]  Max Kaufmann Syntactic Normalization of Twitter Messages , 2010 .

[5]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[6]  Akshay Java,et al.  The ICWSM 2009 Spinn3r Dataset , 2009 .

[7]  Ming Zhou,et al.  Mining Sequential Patterns and Tree Patterns to Detect Erroneous Sentences , 2007, AAAI.

[8]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[9]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .

[10]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[11]  Min-Yen Kan Optimizing predictive text entry for short message service on mobile phones 1 , 2005 .

[12]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[13]  Claude E. Shannon,et al.  Claude Elwood Shannon , 2001, Problems of Information Transmission.

[14]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Alan Ritter,et al.  Unsupervised Modeling of Twitter Conversations , 2010, NAACL.

[17]  Hitoshi Isahara,et al.  Automatic Error Detection in the Japanese Learners’ English Spoken Data , 2003, ACL.

[18]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[19]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[20]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[21]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[22]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[26]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[27]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[28]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.