A new algorithm for clustering search results

We develop a new algorithm for clustering search results. Differently from many other clustering systems that have been recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets, but instead uses latent semantic indexing on the whole document content. A main contribution of the paper is a novel strategy - called dynamic SVD clustering - to discover the optimal number of singular values to be used for clustering purposes. Moreover, the algorithm is such that the SVD computation step has in practice good performance, which makes it feasible to perform clustering when term vectors are available. We show that the algorithm has very good classification performance, and that it can be effectively used to cluster results of a search engine to make them easier to browse by users. The algorithm has being integrated into the Noodles search engine, a tool for searching and clustering Web and desktop documents.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  D. Calvetti,et al.  AN IMPLICITLY RESTARTED LANCZOS METHOD FOR LARGE SYMMETRIC EIGENVALUE PROBLEMS , 1994 .

[3]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[4]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[5]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[6]  Ramanathan V. Guha,et al.  Semantic search , 2003, WWW '03.

[7]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[8]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[9]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2008, Softw. Pract. Exp..

[10]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[11]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[12]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[13]  G Stix,et al.  The mice that warred. , 2001, Scientific American.

[14]  Yoshi Gotoh DIMENSIONALITY REDUCTION TECHNIQUES FOR SEARCH RESULTS CLUSTERING , 2004 .

[15]  Alexandros Ntoulas,et al.  The infocious web search engine: improving web searching through linguistic analysis , 2005, WWW '05.

[16]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[17]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[18]  Dawid Weiss,et al.  A concept-driven algorithm for clustering search results , 2005, IEEE Intelligent Systems.

[19]  Shourya Roy,et al.  A hierarchical monothetic document clustering algorithm for summarization and browsing search results , 2004, WWW '04.

[20]  M.I.T. Press,et al.  The International Journal of Supercomputer Applications— , 1992 .

[21]  Israel Ben-Shaul,et al.  Ephemeral Document Clustering for Web Applications , 2001 .

[22]  Shivakumar Vaithyanathan,et al.  Exploiting clustering and phrases for context-based information retrieval , 1997, SIGIR '97.

[23]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[24]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[25]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[26]  R. Prim Shortest connection networks and some generalizations , 1957 .

[27]  Dawid Weiss,et al.  Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data , 2004, Intelligent Information Systems.

[28]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[29]  E. Chisholm,et al.  New Term Weighting Formulas for the Vector Space Method in Information Retrieval , 1999 .

[30]  D. Sorensen IMPLICITLY RESTARTED ARNOLDI/LANCZOS METHODS FOR LARGE SCALE EIGENVALUE CALCULATIONS , 1996 .

[31]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.