A Novel Stability Based Feature Selection Framework for k-means Clustering

Stability of a learning algorithm with respect to small input perturbations is an important property, as it implies the derived models to be robust with respect to the presence of noisy features and/or data sample fluctuations. In this paper we explore the effect of stability optimization in the standard feature selection process for the continuous (PCA-based) k-means clustering problem. Interestingly, we derive that stability maximization naturally introduces a tradeoff between cluster separation and variance, leading to the selection of features that have a high cluster separation index that is not artificially inflated by the feature's variance. The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion. In our study, we also analyze several properties of Sparse PCA relevant to stability that promote Sparse PCA as a viable feature selection mechanism for clustering. The practical relevance of the proposed method is demonstrated in the context of cancer research, where we consider the problem of detecting potential tumor biomarkers using microarray gene expression data. The application of our method to a leukemia dataset shows that the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes. Some of them have relative low variance and are not detected without the direct optimization of stability in Sparse PCA based k-means.

[1]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[2]  M. Bertrand,et al.  An overview of the MAGE gene family with the identification of all human members of the family. , 2001, Cancer research.

[3]  Rich Caruana,et al.  On Feature Selection, Bias-Variance, and Bagging , 2009, ECML/PKDD.

[4]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[5]  Chris H. Q. Ding,et al.  Consensus group stable feature selection , 2009, KDD.

[6]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[7]  Wray Buntine,et al.  Machine learning and knowledge discovery in databases : European conference, ECML PKDD 2009, Bled, Slovenia, September 7-11, 2009 : proceedings , 2009 .

[8]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.

[9]  Lester W. Mackey,et al.  Deflation Methods for Sparse PCA , 2008, NIPS.

[10]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[11]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[12]  A Bouamrani,et al.  Expression of S100A8 in leukemic cells predicts poor survival in de novo AML patients , 2011, Leukemia.

[13]  Joost N. Kok Machine Learning: ECML 2007, 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007, Proceedings , 2007, ECML.

[14]  Lior Wolf,et al.  Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15]  F J AGATE,et al.  The nonessentiality of the hypophysis for the induction of tumors with 3,4-benzpyrene. , 1955, Cancer research.

[16]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[17]  Alexandre d'Aspremont,et al.  Full regularization path for sparse principal component analysis , 2007, ICML '07.

[18]  Gottfried Köhler,et al.  Interleukin 8 (IL-8) - a universal biomarker? , 2010, International archives of medicine.

[19]  Michalis Vazirgiannis,et al.  Stability Based Sparse LSI/PCA: Incorporating Feature Selection in LSI and PCA , 2007, ECML.

[20]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[21]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[22]  Yue Han,et al.  A Variance Reduction Framework for Stable Feature Selection , 2010, ICDM.

[23]  Robin Foà,et al.  Bone marrow stromal cells and the upregulation of interleukin-8 production in human T-cell acute lymphoblastic leukemia through the CXCL12/CXCR4 axis and the NF-κB and JNK/AP-1 pathways , 2008, Haematologica.

[24]  D. Waugh,et al.  The Interleukin-8 Pathway in Cancer , 2008, Clinical Cancer Research.