Local connectivity in centroid clustering

Clustering is a fundamental task in unsupervised learning, one that targets to group a dataset into clusters of similar objects. There has been recent interest in embedding normative considerations around fairness within clustering formulations. In this paper, we propose 'local connectivity' as a crucial factor in assessing membership desert in centroid clustering. We use local connectivity to refer to the support offered by the local neighborhood of an object towards supporting its membership to the cluster in question. We motivate the need to consider local connectivity of objects in cluster assignment, and provide ways to quantify local connectivity in a given clustering. We then exploit concepts from density-based clustering and devise LOFKM, a clustering method that seeks to deepen local connectivity in clustering outputs, while staying within the framework of centroid clustering. Through an empirical evaluation over real-world datasets, we illustrate that LOFKM achieves notable improvements in local connectivity at reasonable costs to clustering quality, illustrating the effectiveness of the method.

[1]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[2]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[3]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[4]  Deeparnab Chakrabarty,et al.  Fair Algorithms for Clustering , 2019, NeurIPS.

[5]  P Deepak,et al.  Whither Fair Clustering? , 2020, ArXiv.

[6]  Tomas Maul,et al.  Detecting Point Outliers Using Prune-based Outlier Factor (PLOF) , 2019, ArXiv.

[7]  Deepak Khemani,et al.  Interpretable and reconfigurable clustering of document datasets by deriving word-based rules , 2011, Knowledge and Information Systems.

[8]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[9]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[10]  Savitha Sam Abraham,et al.  Representativity Fairness in Clustering , 2020, WebSci.

[11]  Kamesh Munagala,et al.  Proportionally Fair Clustering , 2019, ICML.

[12]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  Silvio Lattanzi,et al.  Fair Clustering Through Fairlets , 2018, NIPS.

[15]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[16]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[17]  Éric D. Taillard,et al.  Heuristic Methods for Large Centroid Clustering Problems , 2003, J. Heuristics.

[18]  Jihwan Lee,et al.  Fast Outlier Detection Using a Grid-Based Algorithm , 2016, PloS one.

[19]  Dan W. Brockt,et al.  The Theory of Justice , 2017 .

[20]  Reuben Binns,et al.  On the apparent conflict between individual and group fairness , 2019, FAT*.

[21]  Savitha Sam Abraham,et al.  Fairness in Clustering with Multiple Sensitive Attributes , 2019, EDBT.

[22]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..