A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data

We summarise various ways of performing dimensionality reduction on high-dimensional microarray data. Many different feature selection and feature extraction methods exist and they are being widely used. All these methods aim to remove redundant and irrelevant features so that classification of new instances will be more accurate. A popular source of data is microarrays, a biological platform for gathering gene expressions. Analysing microarrays can be difficult due to the size of the data they provide. In addition the complicated relations among the different genes make analysis more difficult and removing excess features can improve the quality of the results. We present some of the most popular methods for selecting significant features and provide a comparison between them. Their advantages and disadvantages are outlined in order to provide a clearer idea of when to use each one of them for saving computational time and resources.

[1]  L. Cooper,et al.  Sequential Search: A Method for Solving Constrained Optimization Problems , 1965, JACM.

[2]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[3]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[4]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[5]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[6]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[7]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[8]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[9]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[10]  Shashi Shekhar,et al.  Data models in geographic information systems , 1997, CACM.

[11]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[12]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[13]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[14]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[15]  Joshua M. Stuart,et al.  MICROARRAY EXPERIMENTS : APPLICATION TO SPORULATION TIME SERIES , 1999 .

[16]  Michael Y. Galperin,et al.  Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs) , 2000, Genome Biology.

[17]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[18]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[19]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[20]  Sanmay Das,et al.  Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.

[21]  W. Wurst,et al.  Permutation-validated principal components analysis of microarray data , 2002, Genome Biology.

[22]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[23]  Mukund Balasubramanian,et al.  The Isomap Algorithm and Topological Stability , 2002, Science.

[24]  William A. Schmitt,et al.  Interactive exploration of microarray gene expression patterns in a reduced dimensional space. , 2002, Genome research.

[25]  Manabu Kotani,et al.  Analysis of DNA microarray data using self-organizing map and kernel based clustering , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[26]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[27]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Jarkko Venna,et al.  Analysis and visualization of gene expression data using Self-Organizing Maps , 2002, Neural Networks.

[29]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[30]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[31]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[32]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[33]  Lijuan Cao,et al.  A comparison of PCA, KPCA and ICA for dimensionality reduction in support vector machine , 2003, Neurocomputing.

[34]  S. Batzoglou,et al.  Application of independent component analysis to microarrays , 2003, Genome Biology.

[35]  J. Comet,et al.  Biological detection of low radiation doses by combining results of two microarray analysis methods. , 2004, Nucleic acids research.

[36]  Dong Xu,et al.  Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. , 2004, Nucleic acids research.

[37]  Yuhang Wang,et al.  Application of Relief-F feature filtering algorithm to selecting informative genes for cancer classification using microarray data , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[38]  Estevam R. Hruschka,et al.  Feature Selection by Bayesian Networks , 2004, Canadian Conference on AI.

[39]  Xiaoxing Liu,et al.  An Entropy-based gene selection method for cancer classification using microarray data , 2005, BMC Bioinformatics.

[40]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[41]  Michael A. Siani-Rose,et al.  A Knowledge-Based Clustering Algorithm Driven by Gene Ontology , 2004, Journal of biopharmaceutical statistics.

[42]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[43]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[44]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[45]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[46]  Kevin Dawson,et al.  Sample phenotype clusters in high-density oligonucleotide microarray data sets are revealed using Isomap, a nonlinear algorithm , 2005, BMC Bioinformatics.

[47]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.

[48]  Lucila Ohno-Machado,et al.  Multivariate selection of genetic markers in diagnostic classification , 2004, Artif. Intell. Medicine.

[49]  Nir Friedman,et al.  Learning Module Networks , 2002, J. Mach. Learn. Res..

[50]  Xin Yao,et al.  Feature Selection for Microarray Data Using Least Squares SVM and Particle Swarm Optimization , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[51]  Piero P. Bonissone,et al.  Unsupervised Fuzzy Ensembles and Their Use in Intrusion Detection , 2005, ESANN.

[52]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[53]  Dechang Chen,et al.  Gene Expression Data Classification With Kernel Principal Component Analysis , 2005, Journal of biomedicine & biotechnology.

[54]  Antai Wang,et al.  Gene selection for microarray data analysis using principal component analysis , 2005, Statistics in medicine.

[55]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[56]  Xin Yao,et al.  Gene selection algorithms for microarray data based on least squares support vector machine , 2006, BMC Bioinformatics.

[57]  S. Chao,et al.  FEATURE DIMENSION REDUCTION FOR MICROARRAY DATA ANALYSIS USING LOCALLY LINEAR EMBEDDING , 2005 .

[58]  Chao Shi,et al.  Feature dimension reduction for microarray data analysis using locally linear embedding , 2005, APBC.

[59]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[60]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[61]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[62]  Emmanuel Barillot,et al.  Classification of microarray data using gene networks , 2007, BMC Bioinformatics.

[63]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[64]  Wei Pan,et al.  Bioinformatics Original Paper Incorporating Gene Functions as Priors in Model-based Clustering of Microarray Gene Expression Data , 2022 .

[65]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[66]  Roger E Bumgarner,et al.  Correction: Multiclass classification of microarray data with repeated measurements: application to cancer , 2006, Genome Biology.

[67]  Wei Pan,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006, Bioinform..

[68]  Francisco Azuaje,et al.  An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors , 2006, BMC Medical Informatics Decis. Mak..

[69]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[70]  Jengnan Tzeng,et al.  Multidimensional scaling for large genomic data sets , 2008, BMC Bioinformatics.

[71]  Yanqing Zhang,et al.  Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis , 2007, TCBB.

[72]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[73]  Sun-Yuan Kung,et al.  Feature Selection for Genomic and Proteomic Data Mining , 2008 .

[74]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[75]  Rajagopalan Srinivasan,et al.  Principal components analysis based methodology to identify differentially expressed genes in time-course microarray data , 2008, BMC Bioinformatics.

[76]  Jin-Kao Hao,et al.  Gene Selection for Microarray Data by a LDA-Based Genetic Algorithm , 2008, PRIB.

[77]  Chun Yang,et al.  Greedy kernel PCA for training data reduction and nonlinear feature extraction in classification , 2009, International Symposium on Multispectral Image Processing and Pattern Recognition.

[78]  Eibe Frank,et al.  Large-scale attribute selection using wrappers , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[79]  Xi Chen,et al.  Integrating Biological Knowledge with Gene Expression Profiles for Survival Prediction of Cancer , 2009, J. Comput. Biol..

[80]  Youping Deng,et al.  Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data , 2009, PloS one.

[81]  Antonio Ortega,et al.  Microarray classification using block diagonal linear discriminant analysis with embedded feature selection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[82]  Sanjay Ranka,et al.  Pathway-BasedFeature Selection Algorithm for Cancer Microarray Data , 2010, Adv. Bioinformatics.

[83]  Vladimir Nikulin,et al.  Penalized Principal Component Analysis of Microarray Data , 2009, CIBB.

[84]  Alireza Osareh,et al.  Machine learning techniques to diagnose breast cancer , 2010, 2010 5th International Symposium on Health Informatics and Bioinformatics.

[85]  Russ B. Altman,et al.  Independent component analysis: Mining microarray data for fundamental human gene expression modules , 2010, J. Biomed. Informatics.

[86]  Zili Zhang,et al.  A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data , 2010, BMC Bioinformatics.

[87]  Martin Dugas,et al.  Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data , 2010, BMC Bioinformatics.

[88]  R. Kustra,et al.  Data-Fusion in Clustering Microarray Data: Balancing Discovery and Interpretability , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[89]  Rebecca W Doerge,et al.  An Empirical Bayesian Method for Estimating Biological Networks from Temporal Microarray Data , 2010, Statistical applications in genetics and molecular biology.

[90]  M. Ehler,et al.  Nonlinear gene cluster analysis with labeling for microarray gene expression data in organ development , 2011, BMC proceedings.

[91]  Richard Weber,et al.  Simultaneous feature selection and classification using kernel-penalized support vector machines , 2011, Inf. Sci..

[92]  Youping Deng,et al.  Gene selection and classification for cancer microarray data based on machine learning and similarity measures , 2011, BMC Genomics.

[93]  R. Balasubramanian,et al.  GA-SVM WRAPPER APPROACH FOR GENE RANKING AND CLASSIFICATION USING EXPRESSIONS OF VERY FEW GENES , 2012 .

[94]  Carlo Vercellis,et al.  An effective double-bounded tree-connected Isomap algorithm for microarray data classification , 2012, Pattern Recognit. Lett..

[95]  T. Marwala,et al.  Microarray data feature selection using hybrid genetic algorithm simulated annealing , 2012, 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel.

[96]  M. Plummer,et al.  Global burden of cancers attributable to infections in 2008: a review and synthetic analysis. , 2012, The Lancet. Oncology.

[97]  Huanlai Xing,et al.  Analyzing Kernel Matrices for the Identification of Differentially Expressed Genes , 2013, PloS one.

[98]  Rasool Fakoor,et al.  Using deep learning to enhance cancer diagnosis and classication , 2013 .

[99]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[100]  Ferran Reverter,et al.  Kernel-PCA data integration with enhanced interpretability , 2014, BMC Systems Biology.