Correlation Clustering Revisited: The "True" Cost of Error Minimization Problems

Correlation Clustering was defined by Bansal, Blum, and Chawla as the problem of clustering a set of elements based on a, possibly inconsistent, binary similarity function between element pairs. Their setting is agnostic in the sense that a ground truth clustering is not assumed to exist, and the cost of a solution is computed against the input similarity function. This problem has been studied in theory and in practice and has been subsequently proven to be APX-Hard. In this work we assume that there does exist an unknown correct clustering of the data. In this setting, we argue that it is more reasonable to measure the output clustering's accuracy against the unknown underlying true clustering. We present two main results. The first is a novel method for continuously morphing a general (non-metric) function into a pseudometric. This technique may be useful for other metric embedding and clustering problems. The second is a simple algorithm for randomly rounding a pseudometric into a clustering. Combining the two, we obtain a certificate for the possibility of getting a solution of factor strictly less than 2 for our problem. This approximation coefficient could not have been achieved by considering the agnostic version of the problem unless P = NP .

[1]  Enrique H. Ruspini,et al.  A New Approach to Clustering , 1969, Inf. Control..

[2]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[3]  Venkatesan Guruswami,et al.  Correlation clustering with a fixed number of clusters , 2005, SODA '06.

[4]  Nikhil Bansal,et al.  Correlation Clustering , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[5]  Steven Skiena,et al.  Integrating microarray data by consensus clustering , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[6]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[7]  Andrzej Pelc,et al.  Deterministic Rendezvous in Graphs , 2003 .

[8]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[10]  Michael Spann,et al.  A new approach to clustering , 1990, Pattern Recognit..

[11]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.

[12]  Paola Bonizzoni,et al.  On the Approximation of Correlation Clustering and Consensus Clustering , 2008, J. Comput. Syst. Sci..

[13]  Nir Ailon,et al.  Aggregation of Partial Rankings, p-Ratings and Top-m Lists , 2007, SODA '07.

[14]  Wolfgang Breuer,et al.  X, Y, Z , 2003 .

[15]  Anonymous Author Robust Reductions from Ranking to Classification , 2006 .

[16]  Andrew Chi-Chih Yao,et al.  Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[17]  Amos Fiat,et al.  Correlation Clustering - Minimizing Disagreements on Arbitrary Weighted Graphs , 2003, ESA.

[18]  Nir Ailon,et al.  Fitting tree metrics: Hierarchical clustering and phylogeny , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[19]  Joydeep Ghosh,et al.  Relationship-based clustering and cluster ensembles for high-dimensional data mining , 2002 .

[20]  John Langford,et al.  Sensitive Error Correcting Output Codes , 2005, COLT.

[21]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[22]  Mehryar Mohri,et al.  An Efficient Reduction of Ranking to Classification , 2007, COLT.