A Tolerance Rough Set Approach to Clustering Web Search Results

Two most popular approaches to facilitate searching for information on the web are represented by web search engine and web directories. Although the performance of search engines is improving every day, searching on the web can be a tedious and time-consuming task due to the huge size and highly dynamic nature of the web. Moreover, the user's "intention behind the search" is not clearly expressed which results in too general, short queries. Results returned by search engine can count from hundreds to hundreds of thousands of documents. One approach to manage the large number of results is clustering. Search results clustering can be defined as a process of automatical grouping search results into to thematic groups. However, in contrast to traditional document clustering, clustering of search results are done on-the-fly (per user query request) and locally on a limited set of results return from the search engine. Clustering of search results can help user navigate through large set of documents more efficiently. By providing concise, accurate description of clusters, it lets user localizes interesting document faster.In this paper, we proposed an approach to search results clustering based on Tolerance Rough Set following the work on document clustering [4,3]. Tolerance classes are used to approximate concepts existed in documents. The application of Tolerance Rough Set model in document clustering was proposed as a way to enrich document and cluster representation with the hope of increasing clustering performance.

[1]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .

[2]  Andrzej Skowron,et al.  Tolerance Approximation Spaces , 1996, Fundam. Informaticae.

[3]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[4]  Wojciech Ziarko,et al.  Variable Precision Rough Set Model , 1993, J. Comput. Syst. Sci..

[5]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[6]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[7]  Frank A. Srnad ja,et al.  From N-Grams to Collocations: An Evaluation of Xtract , 1991, ACL.

[8]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[9]  J. Stefanowski,et al.  A HIERARCHICAL WWW PAGES CLUSTERING ALGORITHM BASED ON THE VECTOR SPACE MODEL , 2003 .

[10]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[11]  Abraham Kandel,et al.  Design and implementation of a web mining system for organizing search engine results , 2005, Int. J. Intell. Syst..

[12]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[13]  Marti A. Hearst,et al.  Scatter/gather browsing communicates the topic structure of a very large text collection , 1996, CHI.

[14]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[15]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[16]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[17]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[20]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[21]  Tu Bao Ho,et al.  Hierarchical Document Clustering Based on Tolerance Rough Set Model , 2000, PKDD.

[22]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[23]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[24]  Dawid Weiss,et al.  Carrot and Language Properties in Web Search Results Clustering , 2003, AWIC.

[25]  Frank Smadja,et al.  From N-Grams to Collocations: An Evaluation of Xtract , 1991, ACL.

[26]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[27]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[28]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[29]  Andrzej Skowron,et al.  Rough Sets: A Tutorial , 1998 .

[30]  Tu Bao Ho,et al.  Nonhierarchical document clustering based on a tolerance rough set model , 2002, Int. J. Intell. Syst..

[31]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[32]  Stanislaw Osinski,et al.  An Algorithm for Clustering of Web Search Results , 2003 .

[33]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[34]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[35]  Z. Pawlak,et al.  Rough membership functions , 1994 .