Clustering with a faulty oracle

Clustering, i.e., finding groups in the data, is a problem that permeates multiple fields of science and engineering. Recently, the problem of clustering with a noisy oracle has drawn attention due to various applications including crowdsourced entity resolution [33], and predicting signs of interactions in large-scale online social networks [20, 21]. Here, we consider the following fundamental model for two clusters as proposed by Mitzenmacher and Tsourakakis [28], and Mazumdar and Saha [25]; there exist n items, belonging to two unknown groups. We are allowed to query any pair of nodes whether they belong to the same cluster or not, but the answer to the query is corrupted with some probability . Let 1 > δ = 1 − 2q > 0 be the bias. In this work, we provide a polynomial time algorithm that recovers all signs correctly with high probability in the presence of noise with queries. This is the best known result for this problem for all but tiny δ, improving on the current state-of-the-art due to Mazumdar and Saha [25].

[1]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[2]  Aravindan Vijayaraghavan,et al.  Correlation Clustering with Noisy Partial Information , 2014, COLT.

[3]  N. Alon,et al.  il , , lsoperimetric Inequalities for Graphs , and Superconcentrators , 1985 .

[4]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[5]  Charalampos E. Tsourakakis,et al.  Optimal Learning of Joint Alignments with a Faulty Oracle , 2019, 2020 Information Theory and Applications Workshop (ITA).

[6]  Venkatesan Guruswami,et al.  Query strategies for priced information (extended abstract) , 2000, STOC '00.

[7]  Arya Mazumdar,et al.  Clustering with Noisy Queries , 2017, NIPS.

[8]  Edo Liberty,et al.  Correlation clustering: from theory to practice , 2014, KDD.

[9]  Charalampos E. Tsourakakis,et al.  Joint Alignment from Pairwise Differences with a Noisy Oracle , 2018, WAW.

[10]  van Vu,et al.  A Simple SVD Algorithm for Finding Hidden Partitions , 2014, Combinatorics, Probability and Computing.

[11]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[12]  Hector Garcia-Molina,et al.  Entity Resolution with crowd errors , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[14]  Mark Braverman,et al.  Noisy sorting without resampling , 2007, SODA '08.

[15]  Yuxin Chen,et al.  The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences , 2016, Communications on Pure and Applied Mathematics.

[16]  Ulrik Brandes,et al.  Experiments on Graph Clustering Algorithms , 2003, ESA.

[17]  Claudio Gentile,et al.  A Correlation Clustering Approach to Link Classification in Signed Networks , 2012, COLT.

[18]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[19]  Sujay Sanghavi,et al.  Clustering Sparse Graphs , 2012, NIPS.

[20]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[21]  Bruce E. Hajek,et al.  Achieving exact cluster recovery threshold via semidefinite programming , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[22]  Roded Sharan,et al.  Cluster graph modification problems , 2002, Discret. Appl. Math..

[23]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[24]  Jure Leskovec,et al.  Signed networks in social media , 2010, CHI.

[25]  Arya Mazumdar,et al.  Clustering Via Crowdsourcing , 2016, ArXiv.

[26]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[27]  Nagarajan Natarajan,et al.  Prediction and clustering in signed networks: a local to global perspective , 2013, J. Mach. Learn. Res..

[28]  Emmanuel Abbe,et al.  Exact Recovery in the Stochastic Block Model , 2014, IEEE Transactions on Information Theory.

[29]  Jure Leskovec,et al.  Predicting positive and negative links in online social networks , 2010, WWW '10.

[30]  Amit Kumar,et al.  Sorting and selection with structured costs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[31]  Yudong Chen,et al.  Clustering Partially Observed Graphs via Convex Optimization , 2011, ICML.

[32]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[33]  Claire Mathieu,et al.  Correlation clustering with noisy input , 2010, SODA '10.