A New Approach to Search Result Clustering and Labeling

Search engines present query results as a long ordered list of web snippets divided into several pages. Post-processing of retrieval results for easier access of desired information is an important research problem. In this paper, we present a novel search result clustering approach to split the long list of documents returned by search engines into meaningfully grouped and labeled clusters. Our method emphasizes clustering quality by using cover coefficient-based and sequential k-means clustering algorithms. A cluster labeling method based on term weighting is also introduced for reflecting cluster contents. In addition, we present a new metric that employs precision and recall to assess the success of cluster labeling. We adopt a comparative strategy to derive the relative performance of the proposed method with respect to two prominent search result clustering methods: Suffix Tree Clustering and Lingo. Experimental results in the publicly available AMBIENT and ODP-239 datasets show that our method can successfully achieve both clustering and labeling tasks.

[1]  Gustaf Neumann,et al.  MSEEC – A Multi Search Engine with Multiple Clustering , 2000 .

[2]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[3]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[4]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[5]  Robert B. Allen,et al.  An interface for navigating clustered document sets returned by queries , 1993, COCS '93.

[6]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[7]  Giuseppe Pirrò,et al.  A semantic similarity metric combining features and intrinsic information content , 2009, Data Knowl. Eng..

[8]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[9]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[10]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[11]  Jeffrey Scott Vitter,et al.  Dynamic maintenance of web indexes using landmarks , 2003, WWW '03.

[12]  Stephen E. Robertson,et al.  Deciphering cluster representations , 2001, Inf. Process. Manag..

[13]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[14]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[15]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[16]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[17]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[18]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[19]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[20]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[21]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[22]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[23]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[24]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[25]  Claudio Carpineto,et al.  Mobile information retrieval with search results clustering: Prototypes and evaluations , 2009 .

[26]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[27]  Fazli Can,et al.  Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases , 1990, TODS.

[28]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[29]  Fazli Can,et al.  Bilkent news portal: a personalizable system with new event detection and tracking capabilities , 2008, SIGIR '08.

[30]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[31]  Tony Veale,et al.  An Intrinsic Information Content Metric for Semantic Similarity in WordNet , 2004, ECAI.

[32]  Ismail Sengör Altingövde,et al.  Efficiency and effectiveness of query processing in cluster-based retrieval , 2004, Inf. Syst..

[33]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[34]  W. Bruce Croft,et al.  An Evaluation of Techniques for Clustering Search Results , 2005 .