On the use of Clustering and the MeSH Controlled Vocabulary to Improve MEDLINE Abstract Search

Databases of genomic documents contain substantial amounts of structured information in addition to the texts of titles and abstracts. Unstructured information retrieval techniques fail to take advantage of the structured information available. This paper describes a technique to improve upon traditional retrieval methods by clustering the retrieval result set into two distinct clusters using additional structural information. Our hypothesis is that the relevant documents are to be found in the tightest cluster of the two, as suggested by van Rijsbergen's cluster hypothesis. We present an experimental evaluation of these ideas based on the relevance judgments of the 2004 TREC workshop Genomics track, and the CLUTO software clustering package. RESUME: Les bases de donnees genomiques contiennent de l' information structuree en plus de l'information textuelle que l'on trouve dans les titres et les resumes d'articles. Les techniques de recherche d'information non-structuree ne sont pas adaptees a l'exploitation de cette information structuree. Cet article decrit une technique d'amelioration des methodes de recherche traditionnelles qui separe un resultat initial de recherche en deux groupes a l'aide de l'information structuree disponible. L'hypothese avancee est que les documents les plus pertinents se trouveront dans le groupe le plus densement peuple, conformement a l'hypothese de groupement de van Rijsbergen. Nous presentons une evaluation experimentale de ces idees qui se base sur les documents juges de l'atelier genomique de TREC 2004 et sur le logiciel de groupement CLUTO.

[1]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[2]  Jennifer Widom,et al.  Exploiting hierarchical domain structure to compute similarity , 2003, TOIS.

[3]  M E Funk,et al.  Indexing consistency in MEDLINE. , 1983, Bulletin of the Medical Library Association.

[4]  Helge Ritter,et al.  A MeSH term based distance measure for document retrieval and labeling assistance , 2003, Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No.03CH37439).

[5]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[6]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.