Feature selection for k-means clustering stability: theoretical analysis and an algorithm

Stability of a learning algorithm with respect to small input perturbations is an important property, as it implies that the derived models are robust with respect to the presence of noisy features and/or data sample fluctuations. The qualitative nature of the stability property enhardens the development of practical, stability optimizing, data mining algorithms as several issues naturally arise, such as: how “much” stability is enough, or how can stability be effectively associated with intrinsic data properties. In the context of this work we take into account these issues and explore the effect of stability maximization in the continuous (PCA-based) k-means clustering problem. Our analysis is based on both mathematical optimization and statistical arguments that complement each other and allow for the solid interpretation of the algorithm’s stability properties. Interestingly, we derive that stability maximization naturally introduces a tradeoff between cluster separation and variance, leading to the selection of features that have a high cluster separation index that is not artificially inflated by the features variance. The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion. In our study, we also analyze several properties of Sparse PCA relevant to stability that promote Sparse PCA as a viable feature selection mechanism for clustering. The practical relevance of the proposed method is demonstrated in the context of cancer research, where we consider the problem of detecting potential tumor biomarkers using microarray gene expression data. The application of our method to a leukemia dataset shows that the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes. Some of them have relative low variance and are not detected without the direct optimization of stability in Sparse PCA based k-means. Apart from the qualitative evaluation, we have also verified our approach as a feature selection method for $$k$$k-means clustering using four cancer research datasets. The quantitative empirical results illustrate the practical utility of our framework as a feature selection mechanism for clustering.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Chris H. Q. Ding,et al.  Consensus group stable feature selection , 2009, KDD.

[3]  Ella Bingham,et al.  Enhancing the Stability of Spectral Ordering with Sparsification and Partial Supervision: Application to Paleontological Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  Ling Huang,et al.  Spectral Clustering with Perturbed Data , 2008, NIPS.

[5]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.

[6]  Lester W. Mackey,et al.  Deflation Methods for Sparse PCA , 2008, NIPS.

[7]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[8]  Lior Wolf,et al.  Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[9]  M. Bertrand,et al.  of All Human Members of the Family Gene Family with the Identification MAGE An Overview of the Updated , 2001 .

[10]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[11]  Ella Bingham,et al.  Enhancing the stability and efficiency of spectral ordering with partial supervision and feature selection , 2009, Knowledge and Information Systems.

[12]  Rich Caruana,et al.  On Feature Selection, Bias-Variance, and Bagging , 2009, ECML/PKDD.

[13]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[14]  Werner Rheinboldt,et al.  Computer Science and Scientific Computing , 1989 .

[15]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[16]  D. Waugh,et al.  The Interleukin-8 Pathway in Cancer , 2008, Clinical Cancer Research.

[17]  Wray Buntine,et al.  Machine learning and knowledge discovery in databases : European conference, ECML PKDD 2009, Bled, Slovenia, September 7-11, 2009 : proceedings , 2009 .

[18]  Elena Marchiori,et al.  A Novel Stability Based Feature Selection Framework for k-means Clustering , 2011, ECML/PKDD.

[19]  Alexandre d'Aspremont,et al.  Optimal Solutions for Sparse Principal Component Analysis , 2007, J. Mach. Learn. Res..

[20]  Alexandre d'Aspremont,et al.  Full regularization path for sparse principal component analysis , 2007, ICML '07.

[21]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[22]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[23]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[24]  J. Garin,et al.  Expression of S100A8 in Leukemic Cells Predicts Poor Survival in De Novo AML Patients. , 2009 .

[25]  G. Stewart,et al.  Matrix Perturbation Theory , 1990 .

[26]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[27]  Hyuk Cho Data Transformation for Sum Squared Residue , 2010, PAKDD.

[28]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[29]  Yue Han,et al.  A Variance Reduction Framework for Stable Feature Selection , 2010, 2010 IEEE International Conference on Data Mining.

[30]  Gene H. Golub,et al.  Matrix computations , 1983 .

[31]  Gottfried Köhler,et al.  Interleukin 8 (IL-8) - a universal biomarker? , 2010, International archives of medicine.

[32]  Michalis Vazirgiannis,et al.  Stability Based Sparse LSI/PCA: Incorporating Feature Selection in LSI and PCA , 2007, ECML.

[33]  Dimitrios Mavroeidis,et al.  A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering , 2012, TKDD.

[34]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[35]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[36]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[37]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[38]  Robin Foà,et al.  Bone marrow stromal cells and the upregulation of interleukin-8 production in human T-cell acute lymphoblastic leukemia through the CXCL12/CXCR4 axis and the NF-κB and JNK/AP-1 pathways , 2008, Haematologica.

[39]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[40]  Hui Xiong,et al.  Adapting the right measures for K-means clustering , 2009, KDD.