A scalable algorithm for high-quality clustering of web snippets

We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.

[1]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[3]  Israel Ben-Shaul,et al.  Ephemeral Document Clustering for Web Applications , 2001 .

[4]  Steven J. Phillips Acceleration of K-Means and Related Clustering Algorithms , 2002, ALENEX.

[5]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[6]  Dawid Weiss,et al.  Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data , 2004, Intelligent Information Systems.

[7]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[8]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[9]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[10]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[11]  Emilio Di Giacomo,et al.  A Topology-Driven Approach to the Design of Web Meta-search Clustering Engines , 2005, SOFSEM.

[12]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[13]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[14]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[15]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2005, WWW '05.

[17]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .