Concept Extraction and Synonymy Management for Biomedical Information Retrieval

paper reports on work done for the Genomics Track at TREC 2004 by ConverSpeech LLC in conjunction with scientists at the Saccharomyces Genome Database (SGD), the model organism database located at Stanford University, California. The rapidly increasing number of articles in the biomedical literature has created new urgency for software tools that find information relevant to specific information needs. We focused on two challenges in this work: the problems of synonymy (several terms having the same meaning) and polysemy (a term having more than one meaning), and the problem of constructing queries from information needs stated in natural language. We investigated the use of concept extraction for the second problem, relying on the limited statements of information need as the source of textual analysis. To minimize the problem of synonymy, we investigated the use of a language-oriented biomedical ontology and MeSH (Medical Subject Headings) for term expansion. Additionally, to minimize the problem of polysemy, we used extracted concepts to analyze and rank the documents returned by a search. We submitted two sets of results to TREC for evaluation, the first one produced automatically, the second derived from the first by making specific kinds of changes in the query and ranking methods. The mean average precision (MAP) for the automatic result was lower than the median of the 37 submitted runs overall; however, desirable results were obtained for mean average precision at 10 and 100 documents for almost half the topics. The MAP for the derived result was higher than the median, a desirable result.