Clustering from Multiple Uncertain Experts

Utilizing expert input often improves clustering performance. However in a knowledge discovery problem, ground truth is unknown even to an expert. Thus, instead of one expert, we solicit the opinion from multiple experts. The key question motivating this work is: which experts should be assigned higher weights when there is disagreement on whether to put a pair of samples in the same group? To model the uncertainty in constraints from different experts, we build a probabilistic model for pairwise constraints through jointly modeling each expert’s accuracy and the mapping from features to latent cluster assignments. After learning our probabilistic discriminative clustering model and accuracies of different experts, 1) samples that were not annotated by any expert can be clustered using the discriminative clustering model; and 2) experts with higher accuracies are automatically assigned higher weights in determining the latent cluster assignments. Experimental results on UCI benchmark datasets and a real-world disease subtyping dataset demonstrate that our proposed approach outperforms competing alternatives, including semi-crowdsourced clustering, semi-supervised clustering with constraints from majority voting, and consensus clustering.

[1]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[2]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[3]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[4]  Christoph Lange,et al.  Risk loci for chronic obstructive pulmonary disease: a genome-wide association study and meta-analysis. , 2014, The Lancet. Respiratory medicine.

[5]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[6]  Stephanie A. Santorico,et al.  Cluster analysis in the COPDGene study identifies subtypes of smokers with distinct patterns of airway disease and emphysema , 2014, Thorax.

[7]  Matthias Hein,et al.  Constrained 1-Spectral Clustering , 2012, AISTATS.

[8]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[9]  Mark W. Schmidt,et al.  Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm , 2009, AISTATS.

[10]  Jinfeng Yi,et al.  Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning , 2012, NIPS.

[11]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[12]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[15]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[16]  Jiaquan Xu,et al.  Deaths: final data for 2010. , 2013, National vital statistics reports : from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System.

[17]  Andreas Krause,et al.  Discriminative Clustering by Regularized Information Maximization , 2010, NIPS.

[18]  Pietro Perona,et al.  Crowdclustering , 2011, NIPS.

[19]  Chris H. Q. Ding,et al.  Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[20]  Jinfeng Yi,et al.  Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach , 2012, HCOMP@AAAI.

[21]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.