A divide-and-merge methodology for clustering

We present a divide-and-merge methodology for clustering a set of objects that combines a top-down "divide" phase with a bottom-up "merge" phase. In contrast, previous algorithms either use top-down or bottom-up methods to construct a hierarchical clustering or produce a flat clustering using local search (e.g., k-means). Our divide phase produces a tree whose leaves are the elements of the set. For this phase, we use an efficient spectral algorithm. The merge phase quickly finds an optimal tree-respecting partition for many natural objective functions, e.g., k-means, min-diameter, min-sum, correlation clustering, etc., We present a meta-search engine that uses this methodology to cluster results from web searches. We also give empirical results on text-based data where the algorithm performs better than or competitively with existing clustering algorithms.

[1]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[2]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[3]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[4]  Gene H. Golub,et al.  Matrix computations , 1983 .

[5]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[6]  James Theiler,et al.  Contiguity-enhanced k-means clustering algorithm for unsupervised multispectral image segmentation , 1997, Optics & Photonics.

[7]  Amit Kumar,et al.  A simple linear time ( 1+ ε)- approximation algorithm for geometric k-means clustering in any dimensions , 2004 .

[8]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[9]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[10]  Evangelos E. Milios,et al.  Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets , 2001, AISTATS.

[11]  Teofilo F. Gonzalez,et al.  P-Complete Approximation Problems , 1976, J. ACM.

[12]  Chaitanya Swamy,et al.  Correlation Clustering: maximizing agreements via semidefinite programming , 2004, SODA '04.

[13]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[14]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[15]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[16]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[18]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[19]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[20]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[21]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[22]  Ada Wai-Chee Fu,et al.  Incremental Document Clustering for Web Page Classification , 2002 .

[23]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[24]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[25]  Vijay V. Vazirani,et al.  Finding k-cuts within twice the optimal , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[26]  Thomas Hofmann,et al.  The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data , 1999, IJCAI.

[27]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[28]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[29]  Amos Fiat,et al.  Correlation Clustering - Minimizing Disagreements on Arbitrary Weighted Graphs , 2003, ESA.

[30]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[31]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[32]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[33]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[34]  Nicole Immorlica,et al.  Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques , 2003, Lecture Notes in Computer Science.

[35]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[36]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[37]  Mark Jerrum,et al.  Approximate Counting, Uniform Generation and Rapidly Mixing Markov Chains , 1987, International Workshop on Graph-Theoretic Concepts in Computer Science.

[38]  Mark Jerrum,et al.  Approximate Counting, Uniform Generation and Rapidly Mixing Markov Chains , 1987, WG.

[39]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[40]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.