On the Approximation of Correlation Clustering and Consensus Clustering

The Correlation Clustering problem has been introduced recently [N. Bansal, A. Blum, S. Chawla, Correlation Clustering, in: Proc. 43rd Symp. Foundations of Computer Science, FOCS, 2002, pp. 238-247] as a model for clustering data when a binary relationship between data points is known. More precisely, for each pair of points we have two scores measuring the similarity and dissimilarity respectively, of the two points, and we would like to compute an optimal partition where the value of a partition is obtained by summing up the similarity scores of pairs involving points from the same cluster and the dissimilarity scores of pairs involving points from different clusters. A closely related problem is Consensus Clustering, where we are given a set of partitions and we would like to obtain a partition that best summarizes the input partitions. The latter problem is a restricted case of Correlation Clustering. In this paper we prove that Minimum Consensus Clustering is APX-hard even for three input partitions, answering an open question in the literature, while Maximum Consensus Clustering admits a PTAS. We exhibit a combinatorial and practical 45-approximation algorithm based on a greedy technique for Maximum Consensus Clustering on three partitions. Moreover, we prove that a PTAS exists for Maximum Correlation Clustering when the maximum ratio between two scores is at most a constant.

[1]  Yoshiko Wakabayashi The Complexity of Computing Medians of Relations , 1998 .

[2]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[4]  Chaitanya Swamy,et al.  Correlation Clustering: maximizing agreements via semidefinite programming , 2004, SODA '04.

[5]  Nicole Immorlica,et al.  Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques , 2003, Lecture Notes in Computer Science.

[6]  Alan M. Frieze,et al.  A new rounding procedure for the assignment problem with applications to dense graph arrangement problems , 2002, Math. Program..

[7]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[8]  Giorgio Gambosi,et al.  Complexity and approximation: combinatorial optimization problems and their approximability properties , 1999 .

[9]  Yoshiko Wakabayashi,et al.  A cutting plane algorithm for a clustering problem , 1989, Math. Program..

[10]  Steven Skiena,et al.  Integrating microarray data by consensus clustering , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[11]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[12]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[13]  Tao Jiang,et al.  A Polynomial Time Approximation Scheme for Inferring Evolutionary Trees from Quartet Topologies and Its Application , 2001, SIAM J. Comput..

[14]  Viggo Kann,et al.  Some APX-completeness results for cubic graphs , 2000, Theor. Comput. Sci..

[15]  Mirko Krivánek,et al.  NP-hard problems in hierarchical-tree clustering , 1986, Acta Informatica.

[16]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[17]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[18]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[19]  Amos Fiat,et al.  Correlation Clustering - Minimizing Disagreements on Arbitrary Weighted Graphs , 2003, ESA.

[20]  Marek Karpinski,et al.  Polynomial time approximation schemes for dense instances of NP-hard problems , 1995, STOC '95.

[21]  Steven Skiena,et al.  Heterogeneous Data Integration with the Consensus Clustering Formalism , 2004, DILS.