Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection

Recently, feature selection and dimensionality reduction have become fundamental tools for many data mining tasks, especially for processing high-dimensional data such as gene expression microarray data. Gene expression microarray data comprises up to hundreds of thousands of features with relatively small sample size. Because learning algorithms usually do not work well with this kind of data, a challenge to reduce the data dimensionality arises. A huge number of gene selection are applied to select a subset of relevant features for model construction and to seek for better cancer classification performance. This paper presents the basic taxonomy of feature selection, and also reviews the state-of-the-art gene selection methods by grouping the literatures into three categories: supervised, unsupervised, and semi-supervised. The comparison of experimental results on top 5 representative gene expression datasets indicates that the classification accuracy of unsupervised and semi-supervised feature selection is competitive with supervised feature selection.

[1]  Li-Yeh Chuang,et al.  Tabu Search and Binary Particle Swarm Optimization for Feature Selection Using Microarray Data , 2009, J. Comput. Biol..

[2]  Thibault Helleputte,et al.  Partially supervised feature selection with regularized linear models , 2009, ICML '09.

[3]  Richard Weber,et al.  Simultaneous feature selection and classification using kernel-penalized support vector machines , 2011, Inf. Sci..

[4]  Anirban Mukherjee,et al.  Cancer Classification from Gene Expression Data by NPPC Ensemble , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[7]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[8]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[9]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[10]  Jagath C. Rajapakse,et al.  Multiclass Gene Selection Using Pareto-Fronts , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Satoru Miyano,et al.  Null space based feature selection method for gene expression data , 2012, Int. J. Mach. Learn. Cybern..

[12]  Shutao Li,et al.  Graph embedding based feature selection , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[13]  Tshilidzi Marwala,et al.  A Population-Based Incremental Learning approach to microarray gene expression feature selection , 2010, 2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel.

[14]  Yihui Liu,et al.  Wavelet feature extraction for high-dimensional microarray data , 2009, Neurocomputing.

[15]  Alexandre d'Aspremont,et al.  Clustering and feature selection using sparse principal component analysis , 2007, ArXiv.

[16]  Sambasivarao Damaraju,et al.  Breast cancer prediction using genome wide single nucleotide polymorphism data , 2013, BMC Bioinformatics.

[17]  Taghi M. Khoshgoftaar,et al.  A review of the stability of feature selection techniques for bioinformatics data , 2012, 2012 IEEE 13th International Conference on Information Reuse & Integration (IRI).

[18]  Zili Zhang,et al.  A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data , 2010, BMC Bioinformatics.

[19]  Fred W. Glover,et al.  Tabu Search - Part I , 1989, INFORMS J. Comput..

[20]  Yang Ai-jun,et al.  Bayesian variable selection for disease classification using gene expression data , 2010 .

[21]  Surajit Ray,et al.  Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction , 2011, BMC Bioinformatics.

[22]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[23]  Qi Shen,et al.  Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification , 2009, Comput. Biol. Medicine.

[24]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[25]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[26]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[27]  Zijiang Yang,et al.  PLS-Based Gene Selection and Identification of Tumor-Specific Genes , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[28]  Sameem Abdul Kareem,et al.  Oral cancer prognosis based on clinicopathologic and genomic markers using a hybrid of feature selection and machine learning methods , 2012, BMC Bioinformatics.

[29]  Haytham Elghazel,et al.  Efficient semi-supervised feature selection by an ensemble approach , 2013 .

[30]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[31]  Chen-Fu Chien,et al.  Cluster analysis of genome-wide expression data for feature extraction , 2009, Expert Syst. Appl..

[32]  Ali Anaissi,et al.  A balanced iterative random forest for gene selection from microarray data , 2013, BMC Bioinformatics.

[33]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[34]  Torben F. Ørntoft,et al.  Identifying distinct classes of bladder carcinoma using microarrays , 2003, Nature Genetics.

[35]  Hugues Bersini,et al.  A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[37]  Driss Aboutajdine,et al.  A two-stage gene selection scheme utilizing MRMR filter and GA wrapper , 2011, Knowledge and Information Systems.

[38]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[39]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian Cancer , 2002 .

[40]  Justin Doak,et al.  An evaluation of feature selection methods and their application to computer security , 1992 .

[41]  P. Pudil,et al.  of Techniques for Large-Scale Feature Selection , 1994 .

[42]  G. Victo Sudha George,et al.  Review on Feature Selection Techniques and the Impact of SVM for Cancer Classification using Gene Expression Profile , 2011, ArXiv.

[43]  Michal Linial,et al.  Novel Unsupervised Feature Filtering of Biological Data , 2006, ISMB.

[44]  Roger E Bumgarner,et al.  Comparative hybridization of an array of 21,500 ovarian cDNAs for the discovery of genes overexpressed in ovarian carcinomas. , 1999, Gene.

[45]  Slobodan Vucetic,et al.  Improving accuracy of microarray classification by a simple multi-task feature selection filter , 2011, Int. J. Data Min. Bioinform..

[46]  Anirban Mukhopadhyay,et al.  An Improved Minimum Redundancy Maximum Relevance Approach for Feature Selection in Gene Expression Data , 2013 .

[47]  Debahuti Mishra,et al.  Feature Selection for Cancer Classification: A Signal-to-noise Ratio Approach , 2011 .

[48]  Minghao Yin,et al.  Multiobjective Binary Biogeography Based Optimization for Feature Selection Using Gene Expression Data , 2013, IEEE Transactions on NanoBioscience.

[49]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[50]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[51]  Edward R. Dougherty,et al.  Is cross-validation better than resubstitution for ranking genes? , 2004, Bioinform..

[52]  Young Bun Kim,et al.  Unsupervised Gene Selection For High Dimensional Data , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[53]  Dong Ling Tong,et al.  Genetic Algorithm-Neural Network (GANN): a study of neural network activation functions and depth of genetic algorithm search applied to feature selection , 2010, Int. J. Mach. Learn. Cybern..

[54]  Yong Wang,et al.  A Novel Method of Feature Selection based on SVM , 2013, J. Comput..

[55]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[56]  Xiaosheng Wang,et al.  A Robust Gene Selection Method for Microarray-based Cancer Classification , 2010, Cancer informatics.

[57]  Huan Liu,et al.  Feature Selection for Clustering: A Review , 2018, Data Clustering: Algorithms and Applications.

[58]  Y. Skaik Understanding and using sensitivity, specificity and predictive values , 2008, Indian journal of ophthalmology.

[59]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[60]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[61]  Chris H. Q. Ding,et al.  Consensus group stable feature selection , 2009, KDD.

[62]  Yvan Saeys,et al.  Feature Selection for Classification of Nucleic Acid Sequences , 2004 .

[63]  Christos Boutsidis,et al.  Unsupervised feature selection for principal components analysis , 2008, KDD.

[64]  Jane You,et al.  Double Selection Based Semi-Supervised Clustering Ensemble for Tumor Clustering from Gene Expression Profiles , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[65]  D. Botstein,et al.  Gene expression patterns in human liver cancers. , 2002, Molecular biology of the cell.

[66]  M. Daumer,et al.  Evaluating Microarray-based Classifiers: An Overview , 2008, Cancer informatics.

[67]  Hamid R. Rabiee,et al.  Fuzzy support vector machine: an efficient rule-based classification technique for microarrays , 2013, BMC Bioinformatics.

[68]  John Crowley,et al.  Global gene expression profiling of multiple myeloma, monoclonal gammopathy of undetermined significance, and normal bone marrow plasma cells. , 2002, Blood.

[69]  Denis Hamad,et al.  Constraint scores for semi-supervised feature selection: A comparative study , 2011, Pattern Recognit. Lett..

[70]  Yonghong Peng,et al.  A novel feature selection approach for biomedical data classification , 2010, J. Biomed. Informatics.

[71]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[72]  Yuanyuan Li,et al.  Feature selection based on sensitivity analysis of fuzzy ISODATA , 2012, Neurocomputing.

[73]  Dongqing Xie,et al.  A New Unsupervised Feature Ranking Method for Gene Expression Data Based on Consensus Affinity , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[74]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[75]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[76]  S. Niijima,et al.  Laplacian Linear Discriminant Analysis Approach to Unsupervised Feature Selection , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[77]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[78]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[79]  Khalid Benabdeslem,et al.  Local-to-global semi-supervised feature selection , 2013, CIKM.

[80]  Salwani Abdullah,et al.  Hybridizing relieff, mRMR filters and GA wrapper approaches for gene selection , 2012 .

[81]  Yogesh R. Shepal A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data , 2014 .

[82]  D. Botstein,et al.  Diversity of gene expression in adenocarcinoma of the lung , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[83]  J.C. Rajapakse,et al.  SVM-RFE With MRMR Filter for Gene Selection , 2010, IEEE Transactions on NanoBioscience.

[84]  Wei Liang,et al.  Gene Selection Using Locality Sensitive Laplacian Score , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[85]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[86]  Kezhi Mao,et al.  Recursive Mahalanobis Separability Measure for Gene Subset Selection , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[87]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[88]  Enrique Alba,et al.  Sensitivity and specificity based multiobjective approach for feature selection: Application to cancer diagnosis , 2009, Inf. Process. Lett..

[89]  Feiping Nie,et al.  Multi-Class L2,1-Norm Support Vector Machine , 2011, 2011 IEEE 11th International Conference on Data Mining.

[90]  Philip S. Yu,et al.  Forward Semi-supervised Feature Selection , 2008, PAKDD.

[91]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[92]  Kwong-Sak Leung,et al.  Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification , 2013, BMC Bioinformatics.

[93]  Bing Liu,et al.  An efficient semi-unsupervised gene selection method via spectral biclustering , 2006, IEEE Transactions on NanoBioscience.

[94]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[95]  Tiejun Tong,et al.  Gene Selection Using Iterative Feature Elimination Random Forests for Survival Outcomes , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[96]  Fred Glover,et al.  Tabu Search - Part II , 1989, INFORMS J. Comput..

[97]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[98]  Jidong Zhao,et al.  Locality sensitive semi-supervised feature selection , 2008, Neurocomputing.

[99]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[100]  Frédérique Bitton,et al.  CATdb: a public access to Arabidopsis transcriptome data from the URGV-CATMA platform , 2007, Nucleic Acids Res..

[101]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[102]  Gilles Brassard,et al.  Fundamentals of Algorithmics , 1995 .

[103]  Dong-Sheng Cao,et al.  Recipe for uncovering predictive genes using support vector machines based on model population analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[104]  Chee Peng Lim,et al.  A Modified Two-Stage SVM-RFE Model for Cancer Classification Using Microarray Data , 2011, ICONIP.

[105]  Amir Jazaeri,et al.  Microarray analysis reveals distinct gene expression profiles among different histologic types of endometrial cancer. , 2003, Cancer research.

[106]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[107]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[108]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[109]  S. P. Fodor DNA SEQUENCING: Massively Parallel Genomics , 1997, Science.

[110]  G. Celeux,et al.  Variable Selection for Clustering with Gaussian Mixture Models , 2009, Biometrics.

[111]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[112]  Haytham Elghazel,et al.  Semi-supervised Feature Importance Evaluation with Ensemble Learning , 2011, 2011 IEEE 11th International Conference on Data Mining.

[113]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[114]  Kazuyuki Murase,et al.  A new wrapper feature selection approach using neural network , 2010, Neurocomputing.

[115]  Yungho Leu,et al.  A novel hybrid feature selection method for microarray data analysis , 2011, Appl. Soft Comput..

[116]  Giorgio Valentini,et al.  A Mathematical Model for the Validation of Gene Selection Methods , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[117]  Chris H. Q. Ding,et al.  Evolving Feature Selection , 2005, IEEE Intell. Syst..

[118]  Chee Keong Kwoh,et al.  A Feature Subset Selection Method Based On High-Dimensional Mutual Information , 2011, Entropy.

[119]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[120]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[121]  Qinghua Hu,et al.  An efficient gene selection technique for cancer recognition based on neighborhood mutual information , 2010, Int. J. Mach. Learn. Cybern..

[122]  Yue Han,et al.  Stable Gene Selection from Microarray Data via Sample Weighting , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[123]  Pan Su,et al.  Feature Selection Ensemble , 2012, Turing-100.

[124]  Feiping Nie,et al.  Discriminative Least Squares Regression for Multiclass Classification and Feature Selection , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[125]  Jill P. Mesirov,et al.  Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets , 2007, PloS one.

[126]  Roliana Ibrahim,et al.  Feature Reduction Using Standard Deviation with Different Subsets Selection in Sentiment Analysis , 2014, ACIIDS.

[127]  N. Iizuka,et al.  MECHANISMS OF DISEASE Mechanisms of disease , 2022 .

[128]  Khalid Benabdeslem,et al.  Efficient Semi-Supervised Feature Selection: Constraint, Relevance, and Redundancy , 2014, IEEE Transactions on Knowledge and Data Engineering.

[129]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[130]  Yihong Gong,et al.  Feature Selection for Gene Expression Using Model-Based Entropy , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[131]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[132]  Mohamed A. Ismail,et al.  A novel ensemble selection method for cancer diagnosis using microarray datasets , 2012, 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE).

[133]  W. Gerald,et al.  Integration of gene expression profiling and clinical variables to predict prostate carcinoma recurrence after radical prostatectomy , 2005, Cancer.

[134]  Yukyee Leung,et al.  A Multiple-Filter-Multiple-Wrapper Approach to Gene Selection and Microarray Data Classification , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[135]  Dejan Juric,et al.  Functional network analysis reveals extended gliomagenesis pathway maps and three novel MYC-interacting genes in human gliomas. , 2005, Cancer research.

[136]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[137]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[138]  Ron Shamir,et al.  SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification , 2009, PloS one.

[139]  Khalid Benabdeslem,et al.  Constrained Laplacian Score for Semi-supervised Feature Selection , 2011, ECML/PKDD.

[140]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[141]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[142]  Feng Yang,et al.  Robust Feature Selection for Microarray Data Based on Multicriterion Fusion , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[143]  Jing Deng,et al.  An efficient two-stage gene selection method for microarray data , 2012 .

[144]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[145]  Francesco Masulli,et al.  Unsupervised Gene Selection and Clustering Using Simulated Annealing , 2005, WILF.

[146]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[147]  Nor Hayati Othman,et al.  A review of feature selection techniques via gene expression profiles , 2008, 2008 International Symposium on Information Technology.

[148]  Hong Peng,et al.  Improving the Computational Efficiency of Recursive Cluster Elimination for Gene Selection , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[149]  John T. Wei,et al.  Integrative molecular concept modeling of prostate cancer progression , 2007, Nature Genetics.

[150]  Ujjwal Maulik,et al.  Fuzzy Preference Based Feature Selection and Semisupervised SVM for Cancer Classification , 2014, IEEE Transactions on NanoBioscience.

[151]  Yong Xu,et al.  Robust PCA based method for discovering differentially expressed genes , 2013, BMC Bioinformatics.

[152]  Leslie S. Smith,et al.  Feature subset selection in large dimensionality domains , 2010, Pattern Recognit..

[153]  Lei Liu,et al.  Ensemble gene selection by grouping for microarray data classification , 2010, J. Biomed. Informatics.

[154]  Li-Yeh Chuang,et al.  Improved binary PSO for feature selection using gene expression data , 2008, Comput. Biol. Chem..

[155]  Sinisa Todorovic,et al.  Local-Learning-Based Feature Selection for High-Dimensional Data Analysis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[156]  Mário A. T. Figueiredo,et al.  Efficient feature selection filters for high-dimensional data , 2012, Pattern Recognit. Lett..

[157]  B. Chandra,et al.  An efficient statistical feature selection approach for classification of gene expression data , 2011, J. Biomed. Informatics.

[158]  P. Brown,et al.  Gene Selection in Arthritis Classification With Large-Scale Microarray Expression Profiles , 2003, Comparative and functional genomics.

[159]  Manoranjan Dash,et al.  Feature Selection for Clustering , 2009, Encyclopedia of Database Systems.

[160]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[161]  Satoru Miyano,et al.  A Top-r Feature Selection Algorithm for Microarray Gene Expression Data , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[162]  Phayung Meesad,et al.  Comparison of hybrid feature selection models on gene expression data , 2010, 2010 Eighth International Conference on ICT and Knowledge Engineering.