论文信息 - A divide-and-merge methodology for clustering

A divide-and-merge methodology for clustering

We present a divide-and-merge methodology for clustering a set of objects that combines a top-down "divide" phase with a bottom-up "merge" phase. In contrast, previous algorithms either use top-down or bottom-up methods to construct a hierarchical clustering or produce a flat clustering using local search (e.g., k-means). Our divide phase produces a tree whose leaves are the elements of the set. For this phase, we use an efficient spectral algorithm. The merge phase quickly finds an optimal tree-respecting partition for many natural objective functions, e.g., k-means, min-diameter, min-sum, correlation clustering, etc., We present a meta-search engine that uses this methodology to cluster results from web searches. We also give empirical results on text-based data where the algorithm performs better than or competitively with existing clustering algorithms.

[1] David R. Karger,et al. Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[2] Sudipto Guha,et al. ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[3] Yi Li,et al. COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[4] Gene H. Golub,et al. Matrix computations , 1983 .

[5] Rajeev Motwani,et al. Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[6] James Theiler,et al. Contiguity-enhanced k-means clustering algorithm for unsupervised multispectral image segmentation , 1997, Optics & Photonics.

[7] Amit Kumar,et al. A simple linear time ( 1+ ε)- approximation algorithm for geometric k-means clustering in any dimensions , 2004 .

[8] Venkatesan Guruswami,et al. Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[9] Oren Etzioni,et al. Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[10] Evangelos E. Milios,et al. Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets , 2001, AISTATS.

[11] Teofilo F. Gonzalez,et al. P-Complete Approximation Problems , 1976, J. ACM.

[12] Chaitanya Swamy,et al. Correlation Clustering: maximizing agreements via semidefinite programming , 2004, SODA '04.

[13] J. A. Hartigan,et al. A k-means clustering algorithm , 1979 .

[14] Anil K. Jain,et al. Algorithms for Clustering Data , 1988 .

[15] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .

[16] J. Mesirov,et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[18] Martin Ester,et al. Frequent term-based text clustering , 2002, KDD.

[19] Inderjit S. Dhillon,et al. Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[20] J. Mesirov,et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[21] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[22] Ada Wai-Chee Fu,et al. Incremental Document Clustering for Web Page Classification , 2002 .

[23] Daniel Boley,et al. Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[24] Chinatsu Aone,et al. Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[25] Vijay V. Vazirani,et al. Finding k-cuts within twice the optimal , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[26] Thomas Hofmann,et al. The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data , 1999, IJCAI.

[27] Michael R. Anderberg,et al. Cluster Analysis for Applications , 1973 .

[28] Santosh S. Vempala,et al. On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[29] Amos Fiat,et al. Correlation Clustering - Minimizing Disagreements on Arbitrary Weighted Graphs , 2003, ESA.

[30] Naftali Tishby,et al. Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[31] Chris H. Q. Ding,et al. Spectral Relaxation for K-means Clustering , 2001, NIPS.

[32] Marek Karpinski,et al. Approximation schemes for clustering problems , 2003, STOC '03.

[33] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[34] Nicole Immorlica,et al. Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques , 2003, Lecture Notes in Computer Science.

[35] Renée J. Miller,et al. LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[36] Amit Kumar,et al. A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[37] Mark Jerrum,et al. Approximate Counting, Uniform Generation and Rapidly Mixing Markov Chains , 1987, International Workshop on Graph-Theoretic Concepts in Computer Science.

[38] Mark Jerrum,et al. Approximate Counting, Uniform Generation and Rapidly Mixing Markov Chains , 1987, WG.

[39] George Karypis,et al. Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[40] Avrim Blum,et al. Correlation Clustering , 2004, Machine Learning.