Probabilistic Fair Clustering

In clustering problems, a central decision-maker is given a complete metric graph over vertices and must provide a clustering of vertices that minimizes some objective function. In fair clustering problems, vertices are endowed with a color (e.g., membership in a group), and the features of a valid clustering might also include the representation of colors in that clustering. Prior work in fair clustering assumes complete knowledge of group membership. In this paper, we generalize prior work by assuming imperfect knowledge of group membership through probabilistic assignments. We present clustering algorithms in this more general setting with approximation ratio guarantees. We also address the problem of "metric membership", where different groups have a notion of order and distance. Experiments are conducted using our proposed algorithms as well as baselines to validate our approach and also surface nuanced concerns when group membership is not known deterministically.

[1]  Michael Carl Tschantz,et al.  Measuring Non-Expert Comprehension of Machine Learning Fairness Metrics , 2019, ICML.

[2]  Heidi Ledford Millions of black people affected by racial bias in health-care algorithms , 2019, Nature.

[3]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[4]  Nisheeth K. Vishnoi,et al.  Coresets for Clustering with Fairness Constraints , 2019, NeurIPS.

[5]  Pranjal Awasthi,et al.  Effectiveness of Equalized Odds for Fair Classification under Imperfect Group Information , 2019, ArXiv.

[6]  Xiaojie Mao,et al.  Assessing algorithmic fairness with unobserved protected class using data combination , 2019, FAT*.

[7]  Sara Ahmadian,et al.  Clustering without Over-Representation , 2019, KDD.

[8]  Krzysztof Onak,et al.  Scalable Fair Clustering , 2019, ICML.

[9]  Miroslav Dudík,et al.  Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? , 2018, CHI.

[10]  Aaron Rieke,et al.  Help wanted: an examination of hiring algorithms, equity, and bias , 2018 .

[11]  Madeleine Udell,et al.  Fairness Under Unawareness: Assessing Disparity When Protected Class Is Unobserved , 2018, FAT.

[12]  Samir Khuller,et al.  On the cost of essentially fair clusterings , 2018, APPROX-RANDOM.

[13]  Julia Rubin,et al.  Fairness Definitions Explained , 2018, 2018 IEEE/ACM International Workshop on Software Fairness (FairWare).

[14]  Silvio Lattanzi,et al.  Fair Clustering Through Fairlets , 2018, NIPS.

[15]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[16]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[17]  Aaron Roth,et al.  Fairness in Learning: Classic and Contextual Bandits , 2016, NIPS.

[18]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[19]  Aravind Srinivasan,et al.  An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.

[20]  Paulo Cortez,et al.  A data-driven approach to predict the success of bank telemarketing , 2014, Decis. Support Syst..

[21]  Aditya Bhaskara,et al.  Centrality of trees for capacitated k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}-center , 2014, Mathematical Programming.

[22]  Latanya Sweeney,et al.  Discrimination in online ad delivery , 2013, CACM.

[23]  Samir Khuller,et al.  LP Rounding for k-Centers with Non-uniform Hard Capacities , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[24]  Franco Turini,et al.  k-NN as an implementation of situation testing for discrimination discovery and prevention , 2011, KDD.

[25]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[26]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[27]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[28]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.

[29]  Rajiv Gandhi,et al.  Dependent rounding and its applications to approximation algorithms , 2006, JACM.

[30]  Dan A. Biddle Adverse Impact and Test Validation: A Practitioner's Guide to Valid and Defensible Employment Testing , 2005 .

[31]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[32]  Bo Thiesson,et al.  The Learning-Curve Sampling Method Applied to Model-Based Clustering , 2002, J. Mach. Learn. Res..

[33]  Pat Langley,et al.  Crafting Papers on Machine Learning , 2000, ICML.

[34]  Samir Khuller,et al.  The Capacitated K-Center Problem , 2000, SIAM J. Discret. Math..

[35]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[36]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[37]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[38]  Joydeep Ghosh,et al.  Data Clustering Algorithms And Applications , 2013 .

[39]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..