Ontology Based Clustering for Improving Genomic IR

Recent work has shown that ontology is useful to improve the performance of information retrieval, especially in biomedical literatures. The method of ontology-based can solve synonym problems. In this paper, we propose a new frame for genomic information retrieval based on UMLS. In our frame, genomic information retrieval includes three processes: first, documents were indexed based UMLS, which means documents were represented by concepts, besides, the concept weight was re-calculated combined with similarity between concepts. Second, documents were clustered using fuzzy c-means method. At last cluster language model is utilized for information retrieval. Our method can solve partly synonymy and polysemy problems. The new method is evaluated on TREC 2004/05 genomics track collections. Experiments show that the retrieval performance is greatly improved by the new method compared with the basic language model.

[1]  Zhoujun Li,et al.  A New Method of Cluster-Based Topic Language Model for Genomic IR , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[2]  Ian Witten,et al.  Data Mining , 2000 .

[3]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[4]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[5]  M. Ng,et al.  Ontology-based Distance Measure for Text Clustering , 2006 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Xiaohua Hu,et al.  Semantic Smoothing for Model-based Document Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[8]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[9]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[10]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[11]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[12]  Xiaohua Hu,et al.  Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR , 2006, ECIR.