A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term similarity measures affect the clustering performance for a certain domain. In this paper, we conduct a comparative study on how different semantic similarity measures of term including path based similarity measure, information content based similarity measure and feature based similarity measure affect document clustering. We evaluate term re-weighting as an important method to integrate domain ontology to clustering process. Meanwhile, we apply k-means clustering on one real-world text dataset, our own corpus generated from PubMed. Experiment results on 8 different semantic measures have shown that: (1) there is no a certain type of similarity measures that significantly outperforms the others; (2) Several similarity measures have rather more stable performance than the others; (3) term re-weighting has positive effects on medical document clustering, but might not be significant when documents are short of terms.

[1]  Xiaohua Hu,et al.  Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering , 2006, KDD '06.

[2]  Dekang Lin,et al.  Principle-Based Parsing Without Overgeneration , 1993, ACL.

[3]  Xiaohua Hu,et al.  Semantic Smoothing for Model-based Document Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[5]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[6]  Joydeep Ghosh,et al.  Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[7]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[8]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[9]  M. Ng,et al.  Ontology-based Distance Measure for Text Clustering , 2006 .

[10]  C. Leacock,et al.  Filling in a sparse training space for word sense identification , 1994 .

[11]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[12]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[13]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[14]  Euripides G. M. Petrakis,et al.  Semantic similarity methods in wordNet and their application to information retrieval on the web , 2005, WIDM '05.

[15]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[16]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[17]  Wesley W. Chu,et al.  Free-text medical document retrieval via phrase-based vector space model , 2002, AMIA.

[18]  Troels Andreasen,et al.  Perspectives on ontology‐based querying , 2007, Int. J. Intell. Syst..

[19]  Ian Witten,et al.  Data Mining , 2000 .

[20]  Troels Andreasen,et al.  Perspectives on ontology-based querying: Research Articles , 2007 .