Spelling correction in the PubMed search engine

It is known that users of internet search engines often enter queries with misspellings in one or more search terms. Several web search engines make suggestions for correcting misspelled words, but the methods used are proprietary and unpublished to our knowledge. Here we describe the methodology we have developed to perform spelling correction for the PubMed search engine. Our approach is based on the noisy channel model for spelling correction and makes use of statistics harvested from user logs to estimate the probabilities of different types of edits that lead to misspellings. The unique problems encountered in correcting search engine queries are discussed and our solutions are outlined.

[1]  Peiling Wang,et al.  Mining longitudinal web queries: Trends and patterns , 2003, J. Assoc. Inf. Sci. Technol..

[2]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[3]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[4]  T. N. Gadd,et al.  PHOENIX: the algorithm , 1990 .

[5]  Hsinchun Chen,et al.  The use of dynamic contexts to improve casual internet searching , 2003, TOIS.

[6]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[7]  Ragnar Nordlie,et al.  “User revealment”—a comparison of initial queries and ensuing question development in online searching and in human reference interactions , 1999, SIGIR '99.

[8]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[9]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[10]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[11]  Justin Zobel,et al.  Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..

[12]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[13]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[14]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[15]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[16]  J. Stapleton Introduction to Probability Theory and Statistical Inference , 1970 .

[17]  J McEntyre,et al.  PubMed: bridging the information gap. , 2001, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[18]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[19]  Robert Sedgewick,et al.  Algorithms in C++, Parts 1-4: Fundamentals, Data Structure, Sorting, Searching, Third Edition , 1998 .

[20]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[21]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.