Online Correlation Clustering

We study the online clustering problem where data items arrive in an online fashion. The algorithm maintains a clustering of data items into similarity classes. Upon arrival of v, the relation between v and previously arrived items is revealed, so that for each u we are told whether v is similar to u. The algorithm can create a new cluster for v and merge existing clusters. When the objective is to minimize disagreements between the clustering and the input, we prove that a natural greedy algorithm is O(n)-competitive, and this is optimal. When the objective is to maximize agreements between the clustering and the input, we prove that the greedy algorithm is .5-competitive; that no online algorithm can be better than .834-competitive; we prove that it is possible to get better than 1/2, by exhibiting a randomized algorithm with competitive ratio .5+c for a small positive fixed constant c.

[1]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[2]  Roded Sharan,et al.  Cluster graph modification problems , 2002, Discret. Appl. Math..

[3]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[4]  Christopher D. Manning,et al.  Enforcing Transitivity in Coreference Resolution , 2008, ACL.

[5]  Claire Mathieu,et al.  Correlation clustering with noisy input , 2010, SODA '10.

[6]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[7]  Marek Karpinski,et al.  Linear time approximation schemes for the Gale-Berlekamp game and related minimization problems , 2009, STOC '09.

[8]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[9]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[10]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[11]  Venkatesan Guruswami,et al.  Correlation clustering with a fixed number of clusters , 2005, SODA '06.

[12]  Amos Fiat,et al.  Correlation clustering in general weighted graphs , 2006, Theor. Comput. Sci..

[13]  Thorsten Joachims,et al.  Error bounds for correlation clustering , 2005, ICML.

[14]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.