Are approximation algorithms for consensus clustering worthwhile?

Consensus clustering has emerged as one of the principal clustering problems in the data mining community. In recent years the theoretical computer science community has generated a number of approximation algorithms for consensus clustering and similar problems. These algorithms run in polynomial time, with performance guaranteed to be at most a certain factor worse than optimal. We investigate the feasibility of the approximation algorithms, in an attempt to link data-mining and theoretical research. On realistic data sets, algorithms with quadratic running times are impractical. Unfortunately these and even worse running times are typical of approximation algorithms. To circumvent this, we sample from the data, run the “slow” algorithms on the sample, and then build a consensus clustering from the seed sample clustering, using a range of techniques. These unsampling techniques are in fact almost as good at creating consensus partitionings as the approximation and data-mining algorithms themselves. We find that one of the latest approximation algorithms is not only fast and effective, but also easy to describe, making it an ideal choice.

[1]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[3]  Steven Skiena,et al.  Integrating microarray data by consensus clustering , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[4]  Yinyu Ye,et al.  An O(n3L) potential reduction algorithm for linear programming , 1991, Math. Program..

[5]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[6]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[7]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[8]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[9]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[10]  Marina MeWi Comparing Clusterings , 2002 .

[11]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[12]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[13]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[14]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[15]  A. van Zuylen Deterministic Approximation Algorithms for Ranking and Clustering Problems , 2005 .

[16]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[17]  Anthony Wirth,et al.  New algorithms research for first year students , 2006, ITICSE '06.

[18]  Yun Zhang,et al.  The Cluster Editing Problem: Implementations and Experiments , 2006, IWPEC.

[19]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[20]  Mari Ostendorf,et al.  Combining Multiple Clustering Systems , 2004, PKDD.