Socially Fair k-Means Clustering

We show that the popular k-means clustering algorithm (Lloyd's heuristic), used for a variety of scientific data, can result in outcomes that are unfavorable to subgroups of data (e.g., demographic groups). Such biased clusterings can have deleterious implications for human-centric applications such as resource allocation. We present a fair k-means objective and algorithm to choose cluster centers that provide equitable costs for different groups. The algorithm, Fair-Lloyd, is a modification of Lloyd's heuristic for k-means, inheriting its simplicity, efficiency, and stability. In comparison with standard Lloyd's, we find that on benchmark datasets, Fair-Lloyd exhibits unbiased performance by ensuring that all groups have equal costs in the output k-clustering, while incurring a negligible increase in running time, thus making it a viable fair option wherever k-means is currently used.

[1]  Pranjal Awasthi,et al.  Improved Spectral-Norm Bounds for Clustering , 2012, APPROX-RANDOM.

[2]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[4]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[5]  R. Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[6]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[7]  Deeparnab Chakrabarty,et al.  Fair Algorithms for Clustering , 2019, NeurIPS.

[8]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[9]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[10]  Shyam Varan Nath,et al.  Crime Pattern Detection Using Data Mining , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops.

[11]  Amit Kumar,et al.  Clustering with Spectral Norm and the k-Means Algorithm , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[12]  COMPAS Risk Scales : Demonstrating Accuracy Equity and Predictive Parity Performance of the COMPAS Risk Scales in Broward County , 2016 .

[13]  Nisheeth K. Vishnoi,et al.  Ranking with Fairness Constraints , 2017, ICALP.

[14]  Christian Sohler,et al.  Fair Coresets and Streaming Algorithms for Fair k-Means Clustering , 2018, ArXiv.

[15]  Krishna P. Gummadi,et al.  Fairness Constraints: Mechanisms for Fair Classification , 2015, AISTATS.

[16]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[17]  Tony H. Grubesic,et al.  On The Application of Fuzzy Clustering for Crime Hot Spot Detection , 2006 .

[18]  Nisheeth K. Vishnoi,et al.  Coresets for Clustering with Fairness Constraints , 2019, NeurIPS.

[19]  Kamesh Munagala,et al.  Proportionally Fair Clustering , 2019, ICML.

[20]  Pranjal Awasthi,et al.  Guarantees for Spectral Clustering with Fairness Constraints , 2019, ICML.

[21]  Siddheswar Ray,et al.  Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation , 2000 .

[22]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[23]  Tessa K Anderson,et al.  Kernel density estimation and K-means clustering to profile road accident hotspots. , 2009, Accident; analysis and prevention.

[24]  Mohit Singh,et al.  Multi-Criteria Dimensionality Reduction with Applications to Fairness , 2019, NeurIPS.

[25]  Guillermo Sapiro,et al.  Minimax Pareto Fairness: A Multi Objective Perspective , 2020, ICML.

[26]  Varghese S. Jacob,et al.  Comparative performance of the FSCL neural net and K-means algorithm for market segmentation , 1996 .

[27]  Andrea Vattani,et al.  k-means Requires Exponentially Many Iterations Even in the Plane , 2008, SCG '09.

[28]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[29]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[30]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[31]  Silvio Lattanzi,et al.  Fair Clustering Through Fairlets , 2018, NIPS.

[32]  Jon M. Kleinberg,et al.  Inherent Trade-Offs in the Fair Determination of Risk Scores , 2016, ITCS.

[33]  Mohit Singh,et al.  The Price of Fair PCA: One Extra Dimension , 2018, NeurIPS.

[34]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[35]  Nisheeth K. Vishnoi,et al.  Fair and Diverse DPP-based Data Summarization , 2018, ICML.

[36]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[37]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[38]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[39]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[40]  Krzysztof Onak,et al.  Scalable Fair Clustering , 2019, ICML.