A generalized hidden Markov model with discriminative training for query spelling correction

Query spelling correction is a crucial component of modern search engines. Existing methods in the literature for search query spelling correction have two major drawbacks. First, they are unable to handle certain important types of spelling errors, such as concatenation and splitting. Second, they cannot efficiently evaluate all the candidate corrections due to the complex form of their scoring functions, and a heuristic filtering step must be applied to select a working set of top-K most promising candidates for final scoring, leading to non-optimal predictions. In this paper we address both limitations and propose a novel generalized Hidden Markov Model with discriminative training that can not only handle all the major types of spelling errors, including splitting and concatenation errors, in a single unified framework, but also efficiently evaluate all the candidate corrections to ensure the finding of a globally optimal correction. Experiments on two query spelling correction datasets demonstrate that the proposed generalized HMM is effective for correcting multiple types of spelling errors. The results also show that it significantly outperforms the current approach for generating top-K candidate corrections, making it a better first-stage filter to enable any other complex spelling correction algorithm to have access to a better working set of candidate corrections as well as to cover splitting and concatenation errors, which no existing method in academic literature can correct.

[1]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[2]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Hang Li,et al.  A unified and discriminative model for query refinement , 2008, SIGIR '08.

[5]  Kuansan Wang,et al.  Web scale NLP: a case study on url word breaking , 2011, WWW.

[6]  Xu Sun,et al.  Learning Phrase-Based Spelling Error Models from Clickthrough Data , 2010, ACL.

[7]  Huizhong Duan,et al.  Online spelling correction for query completion , 2011, WWW.

[8]  Ming Zhou,et al.  Improving Query Spelling Correction Using Web Search Results , 2007, EMNLP-CoNLL.

[9]  Xu Sun,et al.  A Large Scale Ranker-Based System for Search Query Spelling Correction , 2010, COLING.

[10]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[11]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[12]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[13]  ChengXiang Zhai,et al.  Mining term association patterns from search logs for effective query reformulation , 2008, CIKM '08.

[14]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[15]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[16]  Ben Hutchinson,et al.  Using the Web for Language Independent Spellchecking and Autocorrection , 2009, EMNLP.

[17]  Farooq Ahmad,et al.  Learning a Spelling Error Model from Search Query Logs , 2005, HLT.

[18]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[20]  Zhendong Niu,et al.  Concept Based Query Expansion , 2013, 2013 Ninth International Conference on Semantics, Knowledge and Grids.

[21]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[22]  M. J. D. Powell,et al.  An efficient method for finding the minimum of a function of several variables without calculating derivatives , 1964, Comput. J..

[23]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[24]  ChengXiang Zhai,et al.  Unsupervised query segmentation using clickthrough for information retrieval , 2011, SIGIR '11.

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.