Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data

Dimensionality reduction is necessary for gene expression data classification. In this paper, we propose a new method for reducing the dimensionality of gene expression data. First, based on a sparse representation, we developed a new criterion for characterizing the margin, which is called sparse maximum margin discriminant analysis (SMMDA); this approach can be used to find an optimal transform matrix such that the sparse margin is maximal in the transformed space. Second, using SMMDA, we present a new feature extraction method for gene expression data. Third, based on SMMDA, we propose a new discriminant gene selection method. During gene selection, we first found the one-dimensional projection of the gene expression data in the most separable direction using SMMDA. Then, we applied the sparse representation technique to regress the projection, and we obtained the relevance vector for the gene set. Discriminant genes were then selected according to this vector. Compared with the conventional method of maximum margin discriminant analysis, the proposed SMMDA method successfully avoids the difficulty of parameter selection. Extensive experiments using publicly available gene expression datasets showed that SMMDA is efficient for feature extraction and gene selection.

[1]  Yihong Gong,et al.  Feature Selection for Gene Expression Using Model-Based Entropy , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Qi Shen,et al.  Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification , 2009, Comput. Biol. Medicine.

[3]  Bani K. Mallick,et al.  Gene selection using a two-level hierarchical Bayesian model , 2004, Bioinform..

[4]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[5]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[6]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Andrew Kusiak,et al.  Cancer gene search with data-mining and genetic algorithms , 2007, Comput. Biol. Medicine.

[8]  Yang Jing-yu New and Efficient Feature Extraction Methods Based on Maximum Margin Criterion , 2007 .

[9]  Lei Zhang,et al.  Tumor Clustering Using Nonnegative Matrix Factorization With Gene Selection , 2009, IEEE Transactions on Information Technology in Biomedicine.

[10]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[11]  András Kocsor,et al.  Margin Maximizing Discriminant Analysis , 2004, ECML.

[12]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Angel R. Martinez,et al.  MATLAB Statistics Toolbox , 2001 .

[14]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..

[15]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[16]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[17]  Jian Yang,et al.  Why can LDA be performed in PCA transformed space? , 2003, Pattern Recognit..

[18]  Dahua Lin,et al.  Nonparametric Discriminant Analysis for Face Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Xiyi Hang,et al.  Multiclass Gene Selection on Microarray Data Using l1-norm Least Square Regression , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[20]  D. Donoho For most large underdetermined systems of equations, the minimal 𝓁1‐norm near‐solution approximates the sparsest near‐solution , 2006 .

[21]  Yanqing Zhang,et al.  Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis , 2007, TCBB.

[22]  Hong Yan,et al.  Feature Extraction and Uncorrelated Discriminant Analysis for High-Dimensional Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[23]  Xiyi Hang,et al.  Gene Selection Using l1-Norm Least Square Regression , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[24]  Johan A. K. Suykens,et al.  Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction , 2004, Bioinform..

[25]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[26]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[27]  Dao-Qing Dai,et al.  Two-Dimensional Maximum Margin Feature Extraction for Face Recognition , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[28]  J. G. Liao,et al.  Logistic regression for disease classification using microarray data: model selection in a large p and small n case , 2007, Bioinform..

[29]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[30]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[31]  K. Fukunaga,et al.  Nonparametric Discriminant Analysis , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Hitoshi Iba,et al.  Prediction of Cancer Class with Majority Voting Genetic Programming Classifier Using Gene Expression Data , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Kai Yu,et al.  Feature Selection for Gene Expression Using Model-Based Entropy , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[35]  Lei Zhang,et al.  Gene expression data classification using locally linear discriminant embedding , 2010, Comput. Biol. Medicine.

[36]  Tao Jiang,et al.  Efficient and robust feature extraction by maximum margin criterion , 2003, IEEE Transactions on Neural Networks.

[37]  Xiaodong Lin,et al.  Gene expression Gene selection using support vector machines with non-convex penalty , 2005 .