Effective Spell Checking Methods Using Clustering Algorithms

This paper presents a novel approach to spell checking using dictionary clustering. The main goal is to reduce the number of times distances have to be calculated when finding target words for misspellings. The method is unsupervised and combines the application of anomalous pattern initialization and partition around medoids (PAM). To evaluate the method, we used an English misspelling list compiled using real examples extracted from the Birkbeck spelling error corpus.

[1]  M. D. McIlroy,et al.  Development of a Spelling List , 1982, IEEE Trans. Commun..

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  Roger Mitton Ordering the suggestions of a spellchecker without using context , 2009, Nat. Lang. Eng..

[4]  Charles R. Blair,et al.  A Program for Correcting Spelling Errors , 1960, Inf. Control..

[5]  Roger Mitton,et al.  English spelling and the computer , 1995 .

[6]  G. W. Milligan,et al.  The validation of four ultrametric clustering algorithms , 1980, Pattern Recognit..

[7]  Tommi A. Pirinen,et al.  Finite-State Spell-Checking with Weighted Language and Error Models : Building and Evaluating Spell-Checkers with Wikipedia as Corpus , 2010 .

[8]  Diana Inkpen,et al.  Real-word spelling correction using Google web 1Tn-gram data set , 2009, CIKM.

[9]  Jennifer Pedler,et al.  A Large List of Confusion Sets for Spellchecking Assessed Against a Corpus of Real-word Errors , 2010, LREC.

[10]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[11]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[12]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[13]  Boris Mirkin,et al.  Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science) , 2005 .

[14]  Mans Hulden,et al.  Fast approximate string matching with finite automata , 2009 .

[15]  Le Zhao,et al.  Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models , 2011, EMNLP.

[16]  Peter Komisarczuk,et al.  On Initializations for the Minkowski Weighted K-Means , 2012, IDA.

[17]  Lauri Karttunen,et al.  Finite State Morphology , 2003, CSLI Studies in Computational Linguistics.

[18]  R. Morris,et al.  Computer detection of typographical errors , 1975, IEEE Transactions on Professional Communication.

[19]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[20]  Mário J. Silva,et al.  Spelling Correction for Search Engine Queries , 2004, EsTAL.

[21]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[22]  Ben Hutchinson,et al.  Using the Web for Language Independent Spellchecking and Autocorrection , 2009, EMNLP.

[23]  Renato Cordeiro de Amorim An Adaptive Spell Checker Based on PS3M: Improving the Clusters of Replacement Words , 2009, Computer Recognition Systems 3.

[24]  Tommi A. Pirinen,et al.  Creating and Weighting Hunspell Dictionaries as Finite-State Automata , 2010 .

[25]  Fred J. Damerau,et al.  An examination of undetected typing errors , 1989, Inf. Process. Manag..

[26]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[27]  S. Verberne Context-sensitive Spell Checking Based on Word Trigram Probabilities Context-sensitive Spell Checking Based on Word Trigram Probabilities , 2002 .

[28]  Roger Mitton Fifty years of spellchecking , 2010 .

[29]  Trevor I. Fenner,et al.  Weighting Features for Partition around Medoids Using the Minkowski Metric , 2012, IDA.

[30]  Renato Cordeiro de Amorim,et al.  An Empirical Evaluation of Different Initializations on the Number of K-Means Iterations , 2012, MICAI.

[31]  Boris G. Mirkin,et al.  Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads , 2010, J. Classif..

[32]  M. Zampieri,et al.  Evaluating knowledge-poor and knowledge-rich features in automatic classification: A case study in WSD , 2012, 2012 IEEE 13th International Symposium on Computational Intelligence and Informatics (CINTI).

[33]  Dan Roth,et al.  Scaling Up Context-Sensitive Text Correction , 2001, IAAI.