Language independent unsupervised learning of short message service dialect

Noise in textual data such as those introduced by multilinguality, misspellings, abbreviations, deletions, phonetic spellings, non-standard transliteration, etc. pose considerable problems for text-mining. Such corruptions are very common in instant messenger and short message service data and they adversely affect off-the-shelf text mining methods. Most techniques address this problem by supervised methods by making use of hand labeled corrections. But they require human generated labels and corrections that are very expensive and time consuming to obtain because of multilinguality and complexity of the corruptions. While we do not champion unsupervised methods over supervised when quality of results is the singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention that is necessary to generate a parallel labeled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A hidden Markov model (HMM) over a “subsequencized” representation of words is used, where a word is represented as a bag of weighted subsequences. The approximate maximum likelihood inference algorithm used is such that the training phase involves clustering over vectors and not the customary and expensive dynamic programming (Baum–Welch algorithm) over sequences that is necessary for HMMs. A principled transformation of maximum likelihood based “central clustering” cost function of Baum–Welch into a “pairwise similarity” based clustering is proposed. This transformation makes it possible to apply “subsequence kernel” based methods that model delete and insert corruptions well. The novelty of this approach lies in that the expensive (Baum–Welch) iterations required for HMM, can be avoided through an approximation of the loglikelihood function and by establishing a connection between the loglikelihood and a pairwise distance. Anecdotal evidence of efficacy is provided on public and proprietary data.

[1]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[2]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[5]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[6]  L. Venkata Subramaniam,et al.  SMS based Interface for FAQ Retrieval , 2009, ACL.

[7]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[11]  Min-Yen Kan Optimizing predictive text entry for short message service on mobile phones 1 , 2005 .

[12]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[13]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  Michael I. Jordan,et al.  Unsupervised Learning from Dyadic Data , 1998 .

[16]  Farooq Ahmad,et al.  Learning a Spelling Error Model from Search Query Logs , 2005, HLT.

[17]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[18]  Michael I. Jordan,et al.  Probabilistic Independence Networks for Hidden Markov Probability Models , 1997, Neural Computation.

[19]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.