论文信息 - ExKMC: Expanding Explainable k-Means Clustering

ExKMC: Expanding Explainable k-Means Clustering

Despite the popularity of explainable AI, there is limited work on effective methods for unsupervised learning. We study algorithms for $k$-means clustering, focusing on a trade-off between explainability and accuracy. Following prior work, we use a small decision tree to partition a dataset into $k$ clusters. This enables us to explain each cluster assignment by a short sequence of single-feature thresholds. While larger trees produce more accurate clusterings, they also require more complex explanations. To allow flexibility, we develop a new explainable $k$-means clustering algorithm, ExKMC, that takes an additional parameter $k' \geq k$ and outputs a decision tree with $k'$ leaves. We use a new surrogate cost to efficiently expand the tree and to label the leaves with one of $k$ clusters. We prove that as $k'$ increases, the surrogate cost is non-increasing, and hence, we trade explainability for accuracy. Empirically, we validate that ExKMC produces a low cost clustering, outperforming both standard decision tree methods and other algorithms for explainable clustering. Implementation of ExKMC available at this https URL.

[1] Francisco Herrera,et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI , 2020, Inf. Fusion.

[2] Brandon M. Greenwell,et al. Interpretable Machine Learning , 2019, Hands-On Machine Learning with R.

[3] K. Cios,et al. Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome , 2015, PloS one.

[4] Dimitris Bertsimas,et al. Interpretable Clustering via Optimal Trees , 2018, ArXiv.

[5] Carlos Guestrin,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[6] Pierre Geurts,et al. Classifying pairs with trees for supervised biological network inference† †Electronic supplementary information (ESI) available: Implementation and computational issues, supplementary performance curves, and illustration of interpretability of trees. See DOI: 10.1039/c5mb00174a Click here for additi , 2014, Molecular bioSystems.

[7] Jae-Woo Chang,et al. A new cell-based clustering method for large, high-dimensional data in data mining applications , 2002, SAC '02.

[8] Wei-Yin Loh,et al. Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[9] Sepideh Mahabadi,et al. (Individual) Fairness for k-Clustering , 2020, ICML.

[10] Cyrus Rashtchian,et al. Explainable k-Means and k-Medians Clustering , 2020, ICML.

[11] Klaus-Robert Müller,et al. From Clustering to Cluster Explanations via Neural Networks , 2019, IEEE transactions on neural networks and learning systems.

[12] Natalie Klein. Density Estimation Trees , 2015 .

[13] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[14] Claudio De Stefano,et al. Reliable writer identification in medieval manuscripts through page layout features: The "Avila" Bible case , 2018, Eng. Appl. Artif. Intell..

[15] Michael B. Cohen,et al. Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[16] Christos Boutsidis,et al. Unsupervised Feature Selection for the $k$-means Clustering Problem , 2009, NIPS.

[17] Gilles Louppe,et al. Learning to rank with extremely randomized trees , 2010, Yahoo! Learning to Rank Challenge.

[18] S. Dasgupta. The hardness of k-means clustering , 2008 .

[19] Thorsten Joachims,et al. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[20] Philip S. Yu,et al. Clustering Via Decision Tree Construction , 2004 .

[21] Nisheeth K. Vishnoi,et al. Coresets for Clustering with Fairness Constraints , 2019, NeurIPS.

[22] Andrew W. Moore,et al. Mixtures of Rectangles: Interpretable Soft Clustering , 2001, ICML.

[23] Chandan Singh,et al. Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[24] Denis J. Dean,et al. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[25] Ricardo Fraiman,et al. Interpretable clustering using unsupervised binary trees , 2011, Advances in Data Analysis and Classification.

[26] Krzysztof Onak,et al. Scalable Fair Clustering , 2019, ICML.

[27] Daniel Deutch,et al. Constraints-Based Explanations of Classifications , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[28] David M. Mount,et al. A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[29] Ankit Aggarwal,et al. Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[30] Shlomo Zilberstein,et al. Balancing the Tradeoff Between Clustering Value and Interpretability , 2020, AIES.

[31] Ravishankar Krishnaswamy,et al. The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[32] Luc De Raedt,et al. Using Logical Decision Trees for Clustering , 1997, ILP.

[33] Anna Choromanska,et al. Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation , 2016, ICML.

[34] R. Fisher. THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[35] Gilles Louppe,et al. Understanding variable importances in forests of randomized trees , 2013, NIPS.

[36] Konstantin Makarychev,et al. Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering , 2018, STOC.

[37] Christian Sohler,et al. Fair Coresets and Streaming Algorithms for Fair k-Means Clustering , 2018, ArXiv.

[38] Mailyn A. González,et al. A morphological database for Colombian anuran species from conservation-priority ecosystems. , 2019, Ecology.

[39] Pierre Geurts,et al. Global multi-output decision trees for interaction prediction , 2018, Machine Learning.

[40] Pierre Geurts,et al. Inferring biological networks with output kernel trees , 2007, BMC Bioinformatics.

[41] Brian Hobbs,et al. Interpretable Clustering via Discriminative Rectangle Mixture Model , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[42] Christos Boutsidis,et al. Randomized Dimensionality Reduction for $k$ -Means Clustering , 2011, IEEE Transactions on Information Theory.

[43] Chandan Singh,et al. Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[44] Yasser Yasami,et al. A novel unsupervised classification approach for network anomaly detection by k-Means clustering and ID3 decision tree learning methods , 2010, The Journal of Supercomputing.

[45] Cynthia Rudin,et al. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[46] Junxiang Chen,et al. Interpretable clustering methods , 2018 .

[47] Fabrizio Grandoni,et al. Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma , 2019, STOC.

[48] Jon Louis Bentley,et al. Multidimensional binary search trees used for associative searching , 1975, CACM.

[49] Deeparnab Chakrabarty,et al. Fair Algorithms for Clustering , 2019, NeurIPS.

[50] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[51] Scott Lundberg,et al. A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[52] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[53] Zachary Chase Lipton. The mythos of model interpretability , 2016, ACM Queue.

[54] Pierre Michel,et al. Clustering nominal data using unsupervised binary decision trees: Comparisons with the state of the art methods , 2017, Pattern Recognit..

[55] Jayanta Basak,et al. Interpretable hierarchical clustering by constructing an unsupervised decision tree , 2005, IEEE Transactions on Knowledge and Data Engineering.

[56] Pierre Hansen,et al. NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[57] Pranjal Awasthi,et al. Fair k-Center Clustering for Data Summarization , 2019, ICML.