Explainable k-Means and k-Medians Clustering

Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the $k$-means and $k$-medians objectives: Must there exist a tree-induced clustering whose cost is comparable to that of the best unconstrained clustering, and if so, how can it be found? In terms of negative results, we show, first, that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and second, that any tree-induced clustering must in general incur an $\Omega(\log k)$ approximation factor compared to the optimal clustering. On the positive side, we design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves. For two means/medians, we show that a single threshold cut suffices to achieve a constant factor approximation, and we give nearly-matching lower bounds. For general $k \geq 2$, our algorithm is an $O(k)$ approximation to the optimal $k$-medians and an $O(k^2)$ approximation to the optimal $k$-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.

[1]  Motoaki Kawanabe,et al.  How to Explain Individual Classification Decisions , 2009, J. Mach. Learn. Res..

[2]  Philip S. Yu,et al.  Clustering Via Decision Tree Construction , 2004 .

[3]  Muriel Medard,et al.  Same-Cluster Querying for Overlapping Clusters , 2019, NeurIPS.

[4]  Brandon M. Greenwell,et al.  Interpretable Machine Learning , 2019, Hands-On Machine Learning with R.

[5]  Peter A. Flach,et al.  LIMEtree: Interactively Customisable Explanations Based on Local Surrogate Multi-output Regression Trees , 2020, ArXiv.

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Christian Sohler,et al.  Fair Coresets and Streaming Algorithms for Fair k-Means Clustering , 2018, ArXiv.

[9]  Pierre Geurts,et al.  Inferring biological networks with output kernel trees , 2007, BMC Bioinformatics.

[10]  Krzysztof Onak,et al.  Scalable Fair Clustering , 2019, ICML.

[11]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[12]  Varun Kanade,et al.  Online k-means Clustering , 2019, AISTATS.

[13]  Ricardo Fraiman,et al.  Interpretable clustering using unsupervised binary trees , 2011, Advances in Data Analysis and Classification.

[14]  Deeparnab Chakrabarty,et al.  Fair Algorithms for Clustering , 2019, NeurIPS.

[15]  Sagar Kale,et al.  How to Solve Fair k-Center in Massive Data Models , 2020, ICML.

[16]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[17]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[18]  S. Dasgupta The hardness of k-means clustering , 2008 .

[19]  Pierre Michel,et al.  Clustering nominal data using unsupervised binary decision trees: Comparisons with the state of the art methods , 2017, Pattern Recognit..

[20]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[21]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[22]  Daniel Deutch,et al.  Constraints-Based Explanations of Classifications , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[23]  Ulrike von Luxburg,et al.  Explaining the Explainer: A First Theoretical Analysis of LIME , 2020, AISTATS.

[24]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[25]  Pranjal Awasthi,et al.  Fair k-Center Clustering for Data Summarization , 2019, ICML.

[26]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[27]  Konstantin Makarychev,et al.  Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering , 2018, STOC.

[28]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[29]  Nir Ailon,et al.  Approximate Correlation Clustering Using Same-Cluster Queries , 2017, LATIN.

[30]  Fabrizio Grandoni,et al.  Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma , 2019, STOC.

[31]  Arya Mazumdar,et al.  Clustering with Noisy Queries , 2017, NIPS.

[32]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[33]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[34]  Klaus-Robert Müller,et al.  From Clustering to Cluster Explanations via Neural Networks , 2019, IEEE transactions on neural networks and learning systems.

[35]  Dimitris Bertsimas,et al.  Interpretable Clustering via Optimal Trees , 2018, ArXiv.

[36]  Hanna M. Wallach,et al.  Weight of Evidence as a Basis for Human-Oriented Explanations , 2019, ArXiv.

[37]  Aditya Bhaskara,et al.  Robust Algorithms for Online $k$-means Clustering , 2020, ALT.

[38]  Nisheeth K. Vishnoi,et al.  Coresets for Clustering with Fairness Constraints , 2019, NeurIPS.

[39]  Chandan Singh,et al.  Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[40]  Benjamin Moseley,et al.  Fair Hierarchical Clustering , 2020, NeurIPS.

[41]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[42]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[43]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[44]  Christos Boutsidis,et al.  Unsupervised Feature Selection for the $k$-means Clustering Problem , 2009, NIPS.

[45]  Maxim Sviridenko,et al.  An Algorithm for Online K-Means Clustering , 2014, ALENEX.

[46]  Sepideh Mahabadi,et al.  (Individual) Fairness for k-Clustering , 2020, ICML.

[47]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[48]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[49]  Shai Ben-David,et al.  Clustering with Same-Cluster Queries , 2016, NIPS.

[50]  Michal Moshkovitz Unexpected Effects of Online K-means Clustering , 2019, ArXiv.

[51]  Sivan Sabato,et al.  Sequential no-Substitution k-Median-Clustering , 2020, AISTATS.

[52]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .