Clustering with a Reject Option: Interactive Clustering as Bayesian Prior Elicitation

A good clustering can help a data analyst to explore and understand a data set, but what constitutes a good clustering may depend on domain-specific and application-specific criteria. These criteria can be difficult to formalize, even when it is easy for an analyst to know a good clustering when she sees one. We present a new approach to interactive clustering for data exploration, called \ciif, based on a particularly simple feedback mechanism, in which an analyst can choose to reject individual clusters and request new ones. The new clusters should be different from previously rejected clusters while still fitting the data well. We formalize this interaction in a novel Bayesian prior elicitation framework. In each iteration, the prior is adapted to account for all the previous feedback, and a new clustering is then produced from the posterior distribution. To achieve the computational efficiency necessary for an interactive setting, we propose an incremental optimization method over data minibatches using Lagrangian relaxation. Experiments demonstrate that \ciif can produce accurate and diverse clusterings.

[1]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[2]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[3]  James Bailey,et al.  Generation of Alternative Clusterings Using the CAMI Approach , 2010, SDM.

[4]  Jordan L. Boyd-Graber,et al.  Interactive topic modeling , 2014, ACL.

[5]  Thomas Hofmann,et al.  Non-redundant data clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[6]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[7]  Vincent Ng,et al.  Which Clustering Do You Want? Inducing Your Ideal Clustering with Minimal Feedback , 2010, J. Artif. Intell. Res..

[8]  Santosh S. Vempala,et al.  A discriminative framework for clustering via similarity functions , 2008, STOC.

[9]  Inderjit S. Dhillon,et al.  Simultaneous Unsupervised Learning of Disparate Clusterings , 2008, Stat. Anal. Data Min..

[10]  Ryan P. Adams,et al.  Contrastive Learning Using Spectral Methods , 2013, NIPS.

[11]  Maria-Florina Balcan,et al.  Clustering with Interactive Feedback , 2008, ALT.

[12]  Quentin Pleple,et al.  Interactive Topic Modeling , 2013 .

[13]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[14]  Ying Cui,et al.  Learning multiple nonredundant clusterings , 2010, TKDD.

[15]  Rishabh K. Iyer,et al.  Submodular Hamming Metrics , 2015, NIPS.

[16]  Rich Caruana,et al.  Meta Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[19]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[20]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[21]  Hal Daumé,et al.  Incorporating Lexical Priors into Topic Models , 2012, EACL.

[22]  M. Cugmas,et al.  On comparing partitions , 2015 .

[23]  James Bailey,et al.  COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[24]  Jeremy E. Oakley,et al.  Uncertain Judgements: Eliciting Experts' Probabilities , 2006 .

[25]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[26]  James Allan,et al.  Interactive Clustering of Text Collections According to a User-Specified Criterion , 2007, IJCAI.