Enhancing search result clustering with semantic indexing

Semantic search results clustering is one of the most wanted functionalities of many information retrieval systems including general web search engines as well as domain specific article portals or digital libraries. It may advice the users to describe the need for information in a more precise way. In this paper, we discuss a framework of document description extension which utilizes domain knowledge and semantic similarity. Our idea is based on application of Tolerance Rough Set Model, semantic information extracted from source text and domain ontology to approximate concepts associated with documents and to enrich the vector representation. Some document representation models including document meta-data, citations and semantic information build using MeSH ontology. We compare those models in a search result clustering problem over the freely accessed biomedical research articles from Pubmed Cetral (PMC) portal. The experimental results are showing the advantages of the proposed models.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  R. J. Roberts PubMed Central: The GenBank of the published literature. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  Tu Bao Ho,et al.  Rough Document Clustering and the Internet , 2008 .

[5]  Hung Son Nguyen,et al.  A method of Web search result clustering based on rough sets , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[6]  Andrzej Skowron,et al.  Tolerance Approximation Spaces , 1996, Fundam. Informaticae.

[7]  Tu Bao Ho,et al.  Nonhierarchical document clustering based on a tolerance rough set model , 2002, Int. J. Intell. Syst..

[8]  J. Stefanowski,et al.  A HIERARCHICAL WWW PAGES CLUSTERING ALGORITHM BASED ON THE VECTOR SPACE MODEL , 2003 .

[9]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[10]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[11]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[12]  Sinh Hoa Nguyen,et al.  Extended Document Representation for Search Result Clustering , 2012, Intelligent Tools for Building a Scientific Information Platform.

[13]  Elmer V. Bernstam,et al.  A day in the life of PubMed: analysis of a typical day's query log. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Andrzej Janusz,et al.  Semantic Clustering of Scientific Articles with Use of DBpedia Knowledge Base , 2012, Intelligent Tools for Building a Scientific Information Platform.

[16]  Tu Bao Ho,et al.  Hierarchical Document Clustering Based on Tolerance Rough Set Model , 2000, PKDD.

[17]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[18]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[19]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[20]  Stanislaw Osinski,et al.  An Algorithm for Clustering of Web Search Results , 2003 .

[21]  L. Hubert,et al.  Comparing partitions , 1985 .

[22]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[23]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.