A review of feature selection techniques in bioinformatics

Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[3]  Chi Hau Chen,et al.  Pattern recognition and signal processing , 1978 .

[4]  Jack Perkins,et al.  Pattern recognition in practice , 1980 .

[5]  Laveen N. Kanal,et al.  Classification, Pattern Recognition and Reduction of Dimensionality , 1982, Handbook of Statistics.

[6]  Jack Sklansky,et al.  On Automatic Feature Selection , 1988, Int. J. Pattern Recognit. Artif. Intell..

[7]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[8]  Henrik I. Christensen,et al.  Pattern Recognition in Practice IV: Multiple Paradigms, Comparative Studies and Hybrid Systems , 1994 .

[9]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[10]  Ron Kohavi,et al.  Data Mining Using MLC a Machine Learning Library in C++ , 1996, Int. J. Artif. Intell. Tools.

[11]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[12]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[13]  Antonia J. Jones,et al.  Feature selection for genetic sequence classification , 1998, Bioinform..

[14]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[15]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[16]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[18]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[19]  Ian Witten,et al.  Data Mining , 2000 .

[20]  Pedro Larrañaga,et al.  Feature Subset Selection by Bayesian network-based optimization , 2000, Artif. Intell..

[21]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[22]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[23]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  J. Thomas,et al.  An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. , 2001, Genome research.

[25]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[26]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[27]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[28]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[29]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[30]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[31]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[32]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[33]  D. Nickerson,et al.  Variation is the spice of life , 2001, Nature Genetics.

[34]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[35]  Peter J. Park,et al.  A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data , 2000, Pacific Symposium on Biocomputing.

[36]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[37]  G. Li,et al.  An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers , 2002, Bioinform..

[38]  Bernard De Baets,et al.  Feature subset selection for splice site prediction , 2002, ECCB.

[39]  Michael B. Eisen,et al.  Identification of regulatory elements using a feature selection method , 2002, Bioinform..

[40]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[41]  Jaques Reifman,et al.  Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions , 2002, Bioinform..

[42]  John D. Storey A direct approach to false discovery rates , 2002 .

[43]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[44]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[45]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[46]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[47]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[48]  Saurabh Sinha,et al.  Discriminative motifs , 2002, RECOMB '02.

[49]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[50]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[51]  Emanuel F Petricoin,et al.  Mass spectrometry-based diagnostics: the upcoming revolution in disease detection. , 2003, Clinical chemistry.

[52]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[53]  Walter Daelemans,et al.  Combined Optimization of Feature Selection and Algorithm Parameter Interaction in Machine Learning of Language , 2003 .

[54]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[55]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[56]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[57]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..

[58]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[59]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[60]  Sergio Verjovski-Almeida,et al.  ESTWeb: bioinformatics services for EST sequencing projects , 2003, Bioinform..

[61]  Yvan Saeys,et al.  Feature selection for splice site prediction: A new method using EDA-based feature ranking , 2004, BMC Bioinformatics.

[62]  Roger E Bumgarner,et al.  Multiclass classification of microarray data with repeated measurements: application to cancer , 2003, Genome Biology.

[63]  Wei Pan,et al.  On the Use of Permutation in and the Performance of A Class of Nonparametric Methods to Detect Differential Gene Expression , 2003, Bioinform..

[64]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[65]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[66]  Anne-Lise Veuthey,et al.  Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation , 2003, ISMB.

[67]  Rainer Breitling,et al.  Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments , 2004, FEBS letters.

[68]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[69]  Ilya Levner,et al.  Feature selection and nearest centroid classification for protein mass spectrometry , 2005, BMC Bioinformatics.

[70]  Robert Tibshirani,et al.  Sample classification from protein mass spectrometry, by 'peak probability contrasts' , 2004, Bioinform..

[71]  Michael J. Becich,et al.  Tests for finding complex patterns of differential expression in cancers: towards individualized medicine , 2004, BMC Bioinformatics.

[72]  Paul Terry,et al.  Application of the GA/KNN method to SELDI proteomics data , 2004, Bioinform..

[73]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[74]  Andrew Kusiak,et al.  Data mining and genetic algorithm based gene/SNP selection , 2004, Artif. Intell. Medicine.

[75]  R. Altman,et al.  Finding haplotype tagging SNPs by use of principal components analysis. , 2004, American journal of human genetics.

[76]  Elena Marchiori,et al.  Feature selection in proteomic pattern data with support vector machines , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[77]  Byoung-Tak Zhang,et al.  PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis , 2004 .

[78]  Soohyun Lee,et al.  CHOISS for selection of single nucleotide polymorphism markers on interval regularity , 2004, Bioinform..

[79]  ROSA BLANCO,et al.  Gene Selection For Cancer Classification Using Wrapper Approaches , 2004, Int. J. Pattern Recognit. Artif. Intell..

[80]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[81]  Pietro Liò,et al.  Identification of DNA regulatory motifs using Bayesian variable selection , 2004, Bioinform..

[82]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[83]  Huiqing Liu,et al.  Using amino acid patterns to accurately predict translation initiation sites , 2004, Silico Biol..

[84]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[85]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.

[86]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[87]  Melanie Hilario,et al.  Mining mass spectra for diagnosis and biomarker discovery of cerebral accidents , 2004, Proteomics.

[88]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[89]  Cheng Cheng,et al.  Improving false discovery rate estimation , 2004, Bioinform..

[90]  Adrian E. Raftery,et al.  Normal uniform mixture differential gene expression detection for cDNA microarrays , 2005, BMC Bioinformatics.

[91]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[92]  Edward R. Dougherty,et al.  Superior feature-set ranking for small samples using bolstered error estimation , 2005, Bioinform..

[93]  Byoung-Tak Zhang,et al.  miTarget: microRNA target gene prediction using a support vector machine , 2006, BMC Bioinformatics.

[94]  Adrian E. Raftery,et al.  Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data , 2005, Bioinform..

[95]  Xia Li,et al.  Application of a Genetic Algorithm - Support Vector Machine Hybrid for Prediction of Clinical Phenotypes Based on Genome-Wide SNP Profiles of Sib Pairs , 2005, FSKD.

[96]  Jiangsheng Yu,et al.  Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data , 2005, ISMB.

[97]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[98]  Claudio Cobelli,et al.  Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data , 2005, Bioinform..

[99]  Eran Halperin,et al.  Tag SNP selection in genotype data for maximizing SNP prediction accuracy , 2005, ISMB.

[100]  Jean Yee Hwa Yang,et al.  Gene expression Identifying differentially expressed genes from microarray experiments via statistic synthesis , 2005 .

[101]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[102]  Ali Al-Shahib,et al.  Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence , 2005, Applied bioinformatics.

[103]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[104]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[105]  Habtom W. Ressom,et al.  Analysis of mass spectral serum profiles for biomarker selection , 2005, Bioinform..

[106]  Wei Zhang,et al.  Large-Scale Ensemble Decision Analysis of Sib-Pair IBD Profiles for Identification of the Relevant Molecular Signatures for Alcoholism , 2005, FSKD.

[107]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[108]  Pierre Geurts,et al.  Proteomic mass spectra classification using decision tree based ensemble methods , 2005, Bioinform..

[109]  Debashis Ghosh,et al.  Classification and Selection of Biomarkers in Genomic Data Using LASSO , 2005, Journal of biomedicine & biotechnology.

[110]  Richard J. Fox,et al.  A two-sample Bayesian t-test for microarray data , 2006, BMC Bioinformatics.

[111]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[112]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[113]  Rainer Spang,et al.  twilight; a Bioconductor package for estimating the local false discovery rate , 2005, Bioinform..

[114]  P. Conilione,et al.  A Comparative Study on Feature Selection for E . coli Promoter Recognition A Comparative Study on Feature Selection for E . coli Promoter Recognition , 2006 .

[115]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[116]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[117]  Slobodan Vucetic Substring selection for biomedical document classification , 2006, TMBIO '06.

[118]  M. Hilario,et al.  Processing and classification of protein mass spectra. , 2006, Mass spectrometry reviews.

[119]  Jeffrey T. Leek,et al.  Gene expression EDGE : extraction and analysis of differential gene expression , 2006 .

[120]  Hagit Shatkay,et al.  BNTagger: improved tagging SNP selection using Bayesian networks , 2006, ISMB.

[121]  Alex Zelikovsky,et al.  MLR-tagging: informative SNP selection for unphased genotypes based on multiple linear regression , 2006, Bioinform..

[122]  Yuhang Wang,et al.  Tumor classification based on DNA copy number aberrations determined using SNP arrays. , 2006, Oncology reports.

[123]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[124]  Pavlos Pavlidis,et al.  Individualized markers optimize class prediction of microarray data , 2006, BMC Bioinformatics.

[125]  Ljubomir J. Buturovic,et al.  PCP: a program for supervised classification of gene expression profiles , 2006, Bioinform..

[126]  Jill P. Mesirov,et al.  Comparative gene marker selection suite , 2006, Bioinform..

[127]  Gabriela Alexe,et al.  A robust meta‐classification strategy for cancer detection from MS data , 2006, Proteomics.

[128]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[129]  Francesco Falciani,et al.  GALGO: an R package for multivariate variable selection using genetic algorithms , 2006, Bioinform..

[130]  Francisco Azuaje,et al.  An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors , 2006, BMC Medical Informatics Decis. Mak..

[131]  Hiroshi Mamitsuka,et al.  Selecting features in microarray classification using ROC curves , 2006, Pattern Recognit..

[132]  Edward R. Dougherty,et al.  What should be expected from feature selection in small-sample settings , 2006, Bioinform..

[133]  Mia K. Markey,et al.  A machine learning perspective on the development of clinical decision support systems utilizing mass spectra of blood samples , 2006, J. Biomed. Informatics.

[134]  Michal Linial,et al.  Novel Unsupervised Feature Filtering of Biological Data , 2006, ISMB.

[135]  Yudi Pawitan,et al.  Multidimensional local false discovery rate for microarray studies , 2006, Bioinform..

[136]  Yvan Saeys,et al.  In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists , 2007, Bioinform..

[137]  Habtom W. Ressom,et al.  Peak selection from MALDI-TOF mass spectra using ant colony optimization , 2007, Bioinform..

[138]  Joaquín Dopazo,et al.  Prophet, a web-based tool for class prediction using microarray data , 2007, Bioinform..

[139]  Jeffrey S. Morris,et al.  Pre-Processing Mass Spectrometry Data , 2007 .

[140]  Sio Iong Ao,et al.  Combining functional and linkage disequilibrium information in the selection of tag SNPs , 2007, Bioinform..