Relaxed Oracles for Semi-Supervised Clustering

Pairwise "same-cluster" queries are one of the most widely used forms of supervision in semi-supervised clustering. However, it is impractical to ask human oracles to answer every query correctly. In this paper, we study the influence of allowing "not-sure" answers from a weak oracle and propose an effective algorithm to handle such uncertainties in query responses. Two realistic weak oracle models are considered where ambiguity in answering depends on the distance between two points. We show that a small query complexity is adequate for effective clustering with high probability by providing better pairs to the weak oracle. Experimental results on synthetic and real data show the effectiveness of our approach in overcoming supervision uncertainties and yielding high quality clusters.

[1]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[2]  Andrew McCallum,et al.  Semi-Supervised Clustering with User Feedback , 2003 .

[3]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[4]  Arya Mazumdar,et al.  Query Complexity of Clustering with Side Information , 2017, NIPS.

[5]  Joydeep Ghosh,et al.  Semi-Supervised Active Clustering with Weak Oracles , 2017, ArXiv.

[6]  Andrea Vattani The hardness of k-means clustering in the plane , 2010 .

[7]  Arya Mazumdar,et al.  Clustering Via Crowdsourcing , 2016, ArXiv.

[8]  Shai Ben-David,et al.  Clustering with Same-Cluster Queries , 2016, NIPS.

[9]  Amit Kumar,et al.  Approximate Clustering with Same-Cluster Queries , 2017, ITCS.

[10]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[11]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[12]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[13]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[14]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[15]  Maria-Florina Balcan,et al.  Clustering with Interactive Feedback , 2008, ALT.

[16]  Ali Ghodsi,et al.  A Dimension-Independent Generalization Bound for Kernel Supervised Principal Component Analysis , 2015, FE@NIPS.

[17]  Avrim Blum,et al.  Center-based clustering under perturbation stability , 2010, Inf. Process. Lett..

[18]  Maria-Florina Balcan,et al.  Clustering under Perturbation Resilience , 2011, SIAM J. Comput..

[19]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[20]  Shai Ben-David,et al.  Representation Learning for Clustering: A Statistical Framework , 2015, UAI.

[21]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[22]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[23]  Shalev Ben-David,et al.  Data stability in clustering: A closer look , 2011, Theor. Comput. Sci..