Semantic frameworks for document and ontology clustering
暂无分享,去创建一个
The Internet has made it possible, in principle, for scientists to quickly find research papers of interest. In practice, the overwhelming volume of publications makes this a time consuming task. It is, therefore, important to develop efficient ways to identify related publications. Clustering, a technique used in many fields, is one way to facilitate this. Ontologies can also help in addressing the problem of finding related entities, including research publications. However, the development of new methods of clustering has focused mainly on the algorithm per se, with relatively less emphasis on feature selection and similarity measures. The latter can significantly impact the accuracy of clustering, as well as the runtime of clustering. Also, to fully realize the high resolution searches that ontologies can make possible, an important first step is to find automatic ways to cluster related ontologies. The major contribution of this dissertation is an innovative semantic framework for document clustering, called Citonomy, a dynamic approach that (1) exploits citation semantics of scientific documents, (2) deals with evolving datasets of documents, and (3) addresses the interplay between algorithms, feature selections, and similarity measures in an integrated manner. This improves accuracy and runtime performance over existing clustering algorithms. As the first step in Citonomy, we propose a new approach to extract and build a model for citation semantics. Both subjective and objective evaluations prove the effectiveness of this model in extracting citation semantics. For the clustering stage, the Citonomy framework offers two approaches: (1) CS-VS: Combining Citation Semantics and VSM(Vector SpaceModel)Measures and (2) CS2CS: From Citation Semantics to Cluster Semantics. CS2CS is a document clustering algorithm with a 3-level feature selection process. It is an improvement over CS-VS in several aspects: (i) deleting the requirement of a training step, (ii) introducing an advanced feature selection mechanism, and (iii) dynamic and adaptive clustering of new datasets. Compared to traditional document clustering, CS-VS and CS2CS significantly improve the accuracy of clustering by 5-15% (on average) in terms of the F-Measure. CS2CS is a linear clustering algorithm that is faster than the common document clustering algorithms K-Means and K-Medoids. In addition, it overcomes a major drawback of K-Means/Medoids algorithms in that the number of clusters can be dynamically determined by splitting and merging clusters. Fuzzy clustering with this approach has also been investigated. The related problem of ontology clustering is also addressed in this dissertation. Another semantics framework, InterOBO, has been designed for ontology clustering. A prototype to demonstrate the potential use of this framework, has been developed. The Open Biomedical Ontologies (OBOs) are used as a case study to illustrate the clustering technique used to identify common concepts and links. Detailed experimental results on different data sets are given to show the merits of the proposed clustering algorithms.