ExKMC: Expanding Explainable k-Means Clustering

Despite the popularity of explainable AI, there is limited work on effective methods for unsupervised learning. We study algorithms for $k$-means clustering, focusing on a trade-off between explainability and accuracy. Following prior work, we use a small decision tree to partition a dataset into $k$ clusters. This enables us to explain each cluster assignment by a short sequence of single-feature thresholds. While larger trees produce more accurate clusterings, they also require more complex explanations. To allow flexibility, we develop a new explainable $k$-means clustering algorithm, ExKMC, that takes an additional parameter $k' \geq k$ and outputs a decision tree with $k'$ leaves. We use a new surrogate cost to efficiently expand the tree and to label the leaves with one of $k$ clusters. We prove that as $k'$ increases, the surrogate cost is non-increasing, and hence, we trade explainability for accuracy. Empirically, we validate that ExKMC produces a low cost clustering, outperforming both standard decision tree methods and other algorithms for explainable clustering. Implementation of ExKMC available at this https URL.

[1]  Francisco Herrera,et al.  Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI , 2020, Inf. Fusion.

[2]  Brandon M. Greenwell,et al.  Interpretable Machine Learning , 2019, Hands-On Machine Learning with R.

[3]  K. Cios,et al.  Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome , 2015, PloS one.

[4]  Dimitris Bertsimas,et al.  Interpretable Clustering via Optimal Trees , 2018, ArXiv.

[5]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[6]  Pierre Geurts,et al.  Classifying pairs with trees for supervised biological network inference† †Electronic supplementary information (ESI) available: Implementation and computational issues, supplementary performance curves, and illustration of interpretability of trees. See DOI: 10.1039/c5mb00174a Click here for additi , 2014, Molecular bioSystems.

[7]  Jae-Woo Chang,et al.  A new cell-based clustering method for large, high-dimensional data in data mining applications , 2002, SAC '02.

[8]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[9]  Sepideh Mahabadi,et al.  (Individual) Fairness for k-Clustering , 2020, ICML.

[10]  Cyrus Rashtchian,et al.  Explainable k-Means and k-Medians Clustering , 2020, ICML.

[11]  Klaus-Robert Müller,et al.  From Clustering to Cluster Explanations via Neural Networks , 2019, IEEE transactions on neural networks and learning systems.

[12]  Natalie Klein Density Estimation Trees , 2015 .

[13]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[14]  Claudio De Stefano,et al.  Reliable writer identification in medieval manuscripts through page layout features: The "Avila" Bible case , 2018, Eng. Appl. Artif. Intell..

[15]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[16]  Christos Boutsidis,et al.  Unsupervised Feature Selection for the $k$-means Clustering Problem , 2009, NIPS.

[17]  Gilles Louppe,et al.  Learning to rank with extremely randomized trees , 2010, Yahoo! Learning to Rank Challenge.

[18]  S. Dasgupta The hardness of k-means clustering , 2008 .

[19]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[20]  Philip S. Yu,et al.  Clustering Via Decision Tree Construction , 2004 .

[21]  Nisheeth K. Vishnoi,et al.  Coresets for Clustering with Fairness Constraints , 2019, NeurIPS.

[22]  Andrew W. Moore,et al.  Mixtures of Rectangles: Interpretable Soft Clustering , 2001, ICML.

[23]  Chandan Singh,et al.  Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[24]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[25]  Ricardo Fraiman,et al.  Interpretable clustering using unsupervised binary trees , 2011, Advances in Data Analysis and Classification.

[26]  Krzysztof Onak,et al.  Scalable Fair Clustering , 2019, ICML.

[27]  Daniel Deutch,et al.  Constraints-Based Explanations of Classifications , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[28]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[29]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[30]  Shlomo Zilberstein,et al.  Balancing the Tradeoff Between Clustering Value and Interpretability , 2020, AIES.

[31]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[32]  Luc De Raedt,et al.  Using Logical Decision Trees for Clustering , 1997, ILP.

[33]  Anna Choromanska,et al.  Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation , 2016, ICML.

[34]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[35]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[36]  Konstantin Makarychev,et al.  Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering , 2018, STOC.

[37]  Christian Sohler,et al.  Fair Coresets and Streaming Algorithms for Fair k-Means Clustering , 2018, ArXiv.

[38]  Mailyn A. González,et al.  A morphological database for Colombian anuran species from conservation-priority ecosystems. , 2019, Ecology.

[39]  Pierre Geurts,et al.  Global multi-output decision trees for interaction prediction , 2018, Machine Learning.

[40]  Pierre Geurts,et al.  Inferring biological networks with output kernel trees , 2007, BMC Bioinformatics.

[41]  Brian Hobbs,et al.  Interpretable Clustering via Discriminative Rectangle Mixture Model , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[42]  Christos Boutsidis,et al.  Randomized Dimensionality Reduction for $k$ -Means Clustering , 2011, IEEE Transactions on Information Theory.

[43]  Chandan Singh,et al.  Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[44]  Yasser Yasami,et al.  A novel unsupervised classification approach for network anomaly detection by k-Means clustering and ID3 decision tree learning methods , 2010, The Journal of Supercomputing.

[45]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[46]  Junxiang Chen,et al.  Interpretable clustering methods , 2018 .

[47]  Fabrizio Grandoni,et al.  Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma , 2019, STOC.

[48]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[49]  Deeparnab Chakrabarty,et al.  Fair Algorithms for Clustering , 2019, NeurIPS.

[50]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[51]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[52]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[53]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[54]  Pierre Michel,et al.  Clustering nominal data using unsupervised binary decision trees: Comparisons with the state of the art methods , 2017, Pattern Recognit..

[55]  Jayanta Basak,et al.  Interpretable hierarchical clustering by constructing an unsupervised decision tree , 2005, IEEE Transactions on Knowledge and Data Engineering.

[56]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[57]  Pranjal Awasthi,et al.  Fair k-Center Clustering for Data Summarization , 2019, ICML.