Fairness, Semi-Supervised Learning, and More: A General Framework for Clustering with Stochastic Pairwise Constraints

Metric clustering is fundamental in areas ranging from Combinatorial Optimization and Data Mining, to Machine Learning and Operations Research. However, in a variety of situations we may have additional requirements or knowledge, distinct from the underlying metric, regarding which pairs of points should be clustered together. To capture and analyze such scenarios, we introduce a novel family of stochastic pairwise constraints, which we incorporate into several essential clustering objectives (radius/median/means). Moreover, we demonstrate that these constraints can succinctly model an intriguing collection of applications, including among others Individual Fairness in clustering and Must-link constraints in semi-supervised learning. Our main result consists of a general framework that yields approximation algorithms with provable guarantees for important clustering objectives, while at the same time producing solutions that respect the stochastic pairwise constraints. Furthermore, for certain objectives we devise improved results in the case of Must-link constraints, which are also the best possible from a theoretical perspective. Finally, we present experimental evidence that validates the effectiveness of our algorithms.

[1]  Yang Liu,et al.  Distributional Individual Fairness in Clustering , 2020, ArXiv.

[2]  Deeparnab Chakrabarty,et al.  Fair Algorithms for Clustering , 2019, NeurIPS.

[3]  Hanna M. Wallach,et al.  Co-Designing Checklists to Understand Organizational Challenges and Opportunities around Fairness in AI , 2020, CHI.

[4]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[5]  Deeparnab Chakrabarty,et al.  Generalized Center Problems with Outliers , 2018, ICALP.

[6]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[7]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[8]  Erko Stackebrandt,et al.  Taxonomic Note: A Place for DNA-DNA Reassociation and 16S rRNA Sequence Analysis in the Present Species Definition in Bacteriology , 1994 .

[9]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[10]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[11]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[12]  Éva Tardos,et al.  Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 2002, JACM.

[13]  Sara Ahmadian,et al.  Clustering without Over-Representation , 2019, KDD.

[14]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[15]  John P. Dickerson,et al.  Probabilistic Fair Clustering , 2020, NeurIPS.

[16]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[17]  Miroslav Dudík,et al.  Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? , 2018, CHI.

[18]  Aravind Srinivasan,et al.  An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.

[19]  Michael Carl Tschantz,et al.  Measuring Non-Expert Comprehension of Machine Learning Fairness Metrics , 2019, ICML.

[20]  Aravind Srinivasan,et al.  A Lottery Model for Center-Type Problems with Outliers , 2017, APPROX-RANDOM.

[21]  Robert C. Edgar,et al.  Updating the 97% identity threshold for 16S ribosomal RNA OTUs , 2017, bioRxiv.

[22]  Jurij Mihelic,et al.  Solving the k-center Problem Efficiently with a Dominating Set Algorithm , 2005, J. Comput. Inf. Technol..

[23]  Rong Yan,et al.  On the value of pairwise constraints in classification and consistency , 2007, ICML '07.

[24]  Jennifer G. Dy,et al.  Multiple Clustering Views from Multiple Uncertain Experts , 2017, ICML.

[25]  Yang Liu,et al.  How do fairness definitions fare? Testing public attitudes towards three algorithmic definitions of fairness in loan allocations , 2020, Artif. Intell..

[26]  Nisheeth K. Vishnoi,et al.  Coresets for Clustering with Fairness Constraints , 2019, NeurIPS.

[27]  Paulo Cortez,et al.  A data-driven approach to predict the success of bank telemarketing , 2014, Decis. Support Syst..

[28]  Samir Khuller,et al.  A Pairwise Fair and Community-preserving Approach to k-Center Clustering , 2020, ICML.

[29]  Mihai Pop,et al.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[30]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[31]  Silvio Lattanzi,et al.  Fair Clustering Through Fairlets , 2018, NIPS.

[32]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[33]  Krzysztof Onak,et al.  Scalable Fair Clustering , 2019, ICML.

[34]  Patrick D Schloss,et al.  OptiClust, an Improved Method for Assigning Amplicon-Based Sequence Data to Operational Taxonomic Units , 2017, mSphere.

[35]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[36]  S. S. Ravi,et al.  A SAT-based Framework for Efficient Constrained Clustering , 2010, SDM.

[37]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[38]  Samir Khuller,et al.  On the cost of essentially fair clusterings , 2018, APPROX-RANDOM.