Matching health information seekers' queries to medical terms

BackgroundThe Internet is a major source of health information but most seekers are not familiar with medical vocabularies. Hence, their searches fail due to bad query formulation. Several methods have been proposed to improve information retrieval: query expansion, syntactic and semantic techniques or knowledge-based methods. However, it would be useful to clean those queries which are misspelled. In this paper, we propose a simple yet efficient method in order to correct misspellings of queries submitted by health information seekers to a medical online search tool.MethodsIn addition to query normalizations and exact phonetic term matching, we tested two approximate string comparators: the similarity score function of Stoilos and the normalized Levenshtein edit distance. We propose here to combine them to increase the number of matched medical terms in French. We first took a sample of query logs to determine the thresholds and processing times. In the second run, at a greater scale we tested different combinations of query normalizations before or after misspelling correction with the retained thresholds in the first run.ResultsAccording to the total number of suggestions (around 163, the number of the first sample of queries), at a threshold comparator score of 0.3, the normalized Levenshtein edit distance gave the highest F-Measure (88.15%) and at a threshold comparator score of 0.7, the Stoilos function gave the highest F-Measure (84.31%). By combining Levenshtein and Stoilos, the highest F-Measure (80.28%) is obtained with 0.2 and 0.7 thresholds respectively. However, queries are composed by several terms that may be combination of medical terms. The process of query normalization and segmentation is thus required. The highest F-Measure (64.18%) is obtained when this process is realized before spelling-correction.ConclusionsDespite the widely known high performance of the normalized edit distance of Levenshtein, we show in this paper that its combination with the Stoilos algorithm improved the results for misspelling correction of user queries. Accuracy is improved by combining spelling, phoneme-based information and string normalizations and segmentations into medical terms. These encouraging results have enabled the integration of this method into two projects funded by the French National Research Agency-Technologies for Health Care. The first aims to facilitate the coding process of clinical free texts contained in Electronic Health Records and discharge summaries, whereas the second aims at improving information retrieval through Electronic Health Records.

[1]  J. Marc Overhage,et al.  Real World Performance of Approximate String Comparators for use in Patient Matching , 2004, MedInfo.

[2]  Robert H. Baud,et al.  Health search engine with e-document analysis for reliable search results , 2006, Int. J. Medical Informatics.

[3]  K. S. Raghavan,et al.  Relationships in the Organization of Knowledge , 2001 .

[4]  Francisca Abad García,et al.  A comparative study of six European databases of medically oriented Web resources. , 2005, Journal of the Medical Library Association : JMLA.

[5]  D. Balota,et al.  Moving beyond Coltheart’s N: A new measure of orthographic similarity , 2008, Psychonomic bulletin & review.

[6]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[7]  Betsy L. Humphreys,et al.  Relationships in Medical Subject Headings (MeSH) , 2001 .

[8]  James L. Peterson,et al.  A note on undetected typing errors , 1986, CACM.

[9]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[10]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[11]  Victoria J. Hodge,et al.  A comparison of a novel neural spell checker and standard spell checking algorithms , 2002, Pattern Recognit..

[12]  Stefanos D. Kollias,et al.  A String Metric for Ontology Alignment , 2005, SEMWEB.

[13]  Li Yujian,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Walter E. Haefeli,et al.  Misspellings in drug information system queries: Characteristics of drug name spelling errors and strategies for their prevention , 2010, Int. J. Medical Informatics.

[15]  Agnieszka Mykowiecka,et al.  Domain-Driven Automatic Spelling Correction for Mammography Reports , 2006, Intelligent Information Systems.

[16]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[17]  Olivier Bodenreider,et al.  An approximate matching method for clinical drug names. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[18]  Traugott Koch,et al.  Quality-controlled subject gateways: definitions, typologies, empirical overview , 2000, Online Inf. Rev..

[19]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[20]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[21]  Pierre Zweigenbaum,et al.  Automatic computation of CHA2DS2-VASc score: information extraction from clinical texts for thromboembolism risk assessment. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[22]  Carol A. Bean,et al.  Relationships in the Organization of Knowledge , 2001, Information Science and Knowledge Management.

[23]  Jean-Raoul Scherrer,et al.  HONselect: a multilingual and intelligent search tool integrating heterogeneous web resources , 2001, Int. J. Medical Informatics.

[24]  Fang Liu,et al.  Bmc Medical Informatics and Decision Making a Umls-based Spell Checker for Natural Language Processing in Vaccine Safety , 2006 .

[25]  Patrick Ruch Using Contextual Spelling Correction to Improve Retrieval Effectiveness in Degraded Text Collections , 2002, COLING.

[26]  Alla Keselman,et al.  Research Paper: Consumer Health Information Seeking as Hypothesis Testing , 2008, J. Am. Medical Informatics Assoc..

[27]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[28]  KukichKaren Techniques for automatically correcting words in text , 1992 .

[29]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[30]  Aurélie Névéol,et al.  Enhancing the MeSH thesaurus to retrieve French online health resources in a quality-controlled gateway. , 2004, Health information and libraries journal.

[31]  Qing Zeng-Treitler,et al.  Research Paper: A Frequency-based Technique to Improve the Spelling Suggestion Rank in Medical Queries , 2004, J. Am. Medical Informatics Assoc..

[32]  W. John Wilbur,et al.  Spelling correction in the PubMed search engine , 2006, Information Retrieval.

[33]  Alexa T. McCray,et al.  Strategies for Supporting Consumer Health Information Seeking , 2004, MedInfo.

[34]  BMC Bioinformatics , 2005 .

[35]  Aurora González Teruel,et al.  A comparative study of six European databases of medically oriented Web resources. , 2005 .

[36]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .