Fair Clustering Under a Bounded Cost

Clustering is a fundamental unsupervised learning problem where a dataset is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the “price of fairness,” can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms.

[1]  Benjamin Moseley,et al.  Fair Hierarchical Clustering , 2020, NeurIPS.

[2]  Ioannis Caragiannis,et al.  The Efficiency of Fair Division , 2009, Theory of Computing Systems.

[3]  Ariel D. Procaccia,et al.  Price of fairness in kidney exchange , 2014, AAMAS.

[4]  Sara Ahmadian,et al.  Clustering without Over-Representation , 2019, KDD.

[5]  Dimitris Bertsimas,et al.  The Price of Fairness , 2011, Oper. Res..

[6]  Xiaohui Bei,et al.  Balancing Efficiency and Fairness in On-Demand Ridesourcing , 2019, NeurIPS.

[7]  Chris Schwiegelshohn,et al.  Fair Clustering with Multiple Colors , 2020, ArXiv.

[8]  Aravind Srinivasan,et al.  A unified approach to scheduling on unrelated parallel machines , 2009, JACM.

[9]  Aravind Srinivasan,et al.  Balancing the Tradeoff between Profit and Fairness in Rideshare Platforms during High-Demand Hours , 2019, AIES.

[10]  Sara Ahmadian,et al.  Fair Correlation Clustering , 2020, AISTATS.

[11]  Lawrence G. Sager Handbook of Computational Social Choice , 2015 .

[12]  John P. Dickerson,et al.  Balancing Lexicographic Fairness and a Utilitarian Objective with Application to Kidney Exchange , 2017, AAAI.

[13]  Nisheeth K. Vishnoi,et al.  Coresets for Clustering with Fairness Constraints , 2019, NeurIPS.

[14]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[15]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[16]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[17]  Miroslav Dudík,et al.  Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? , 2018, CHI.

[18]  Aravind Srinivasan,et al.  An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.

[19]  John P. Dickerson,et al.  Probabilistic Fair Clustering , 2020, NeurIPS.

[20]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[21]  Krzysztof Onak,et al.  Scalable Fair Clustering , 2019, ICML.

[22]  Pranjal Awasthi,et al.  Guarantees for Spectral Clustering with Fairness Constraints , 2019, ICML.

[23]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[24]  Sepideh Mahabadi,et al.  (Individual) Fairness for k-Clustering , 2020, ICML.

[25]  Ismail Ben Ayed,et al.  Variational Fair Clustering , 2021, AAAI.

[26]  Deeparnab Chakrabarty,et al.  Fair Algorithms for Clustering , 2019, NeurIPS.

[27]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[28]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[29]  John N. Hooker,et al.  Combining Equity and Utilitarianism in a Mathematical Programming Model , 2012, Manag. Sci..

[30]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[31]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[32]  Toby Walsh,et al.  Fairness in Deceased Organ Matching , 2018, AIES.

[33]  Fedor V. Fomin,et al.  On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications , 2020, ICALP.

[34]  Bo Thiesson,et al.  The Learning-Curve Sampling Method Applied to Model-Based Clustering , 2002, J. Mach. Learn. Res..

[35]  Samir Khuller,et al.  A Pairwise Fair and Community-preserving Approach to k-Center Clustering , 2020, ICML.

[36]  Yufei Yuan,et al.  Modeling multiple humanitarian objectives in emergency response to large-scale disasters , 2015 .

[37]  Michael Carl Tschantz,et al.  Measuring Non-Expert Comprehension of Machine Learning Fairness Metrics , 2019, ICML.

[38]  Silvio Lattanzi,et al.  Fair Clustering Through Fairlets , 2018, NIPS.

[39]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[40]  Suyun Liu,et al.  A Stochastic Alternating Balance k-Means Algorithm for Fair Clustering , 2021, ArXiv.

[41]  Samir Khuller,et al.  On the cost of essentially fair clusterings , 2018, APPROX-RANDOM.

[42]  Aravind Srinivasan,et al.  Fairness, Semi-Supervised Learning, and More: A General Framework for Clustering with Stochastic Pairwise Constraints , 2021, AAAI.