Stable Feature Selection for Biomarker Discovery

Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchical framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development.

[1]  Chris H. Q. Ding,et al.  Consensus group stable feature selection , 2009, KDD.

[2]  Tao Han,et al.  Cross-platform comparability of microarray technology: Intra-platform consistency and appropriate data analysis procedures are essential , 2005, BMC Bioinformatics.

[3]  E. Dougherty,et al.  Accurate and Reliable Cancer Classification Based on Probabilistic Inference of Pathway Activity , 2009, PloS one.

[4]  Paul Tempst,et al.  Pathway-based biomarker search by high-throughput proteomics profiling of secretomes. , 2009, Journal of proteome research.

[5]  Stephen T. C. Wong,et al.  The knowledge-integrated network biomarkers discovery for major adverse cardiac events. , 2008, Journal of proteome research.

[6]  D. DeMets,et al.  Biomarkers and surrogate endpoints: Preferred definitions and conceptual framework , 2001, Clinical pharmacology and therapeutics.

[7]  Tommy W. S. Chow,et al.  Effective Gene Selection Method With Small Sample Sets Using Gradient-Based and Point Injection Techniques , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[9]  Kevin P. Rosenblatt,et al.  Application of multiple statistical tests to enhance mass spectrometry-based biomarker discovery , 2009, BMC Bioinformatics.

[10]  Marko Grobelnik,et al.  Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II , 2009 .

[11]  Wei Pan,et al.  Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms , 2007, Bioinform..

[12]  Bin Yu,et al.  Simultaneous Gene Clustering and Subset Selection for Sample Classification Via MDL , 2003, Bioinform..

[13]  Seon-Young Kim,et al.  Effects of sample size on robustness and prediction accuracy of a prognostic gene signature , 2009, BMC Bioinformatics.

[14]  Bernhard Pfeifer,et al.  A new ensemble-based algorithm for identifying breath gas marker candidates in liver disease using ion molecule reaction mass spectrometry , 2009, Bioinform..

[15]  G. Stolovitzky Gene selection in microarray data: the elephant, the blind men and our algorithms. , 2003, Current opinion in structural biology.

[16]  Sylvia Richardson,et al.  Statistical Applications in Genetics and Molecular Biology Comparing the Characteristics of Gene Expression Profiles Derived by Univariate and Multivariate Classification Methods , 2011 .

[17]  Emmanuel Barillot,et al.  Classification of microarray data using gene networks , 2007, BMC Bioinformatics.

[18]  Wenfei Fan,et al.  Keys for XML , 2001, WWW '01.

[19]  James J. Chen,et al.  Development of biomarker classifiers from high-dimensional data , 2009, Briefings Bioinform..

[20]  H. Zou,et al.  Addendum: Regularization and variable selection via the elastic net , 2005 .

[21]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[22]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[23]  Cesare Furlanello,et al.  Algebraic stability indicators for ranked lists in molecular profiling , 2008, Bioinform..

[24]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Vipin Kumar,et al.  Robust and efficient identification of biomarkers by classifying features on graphs , 2008, Bioinform..

[26]  Taesung Park,et al.  Identification of differentially expressed subnetworks based on multivariate ANOVA , 2009, BMC Bioinformatics.

[27]  Fabio Roli,et al.  Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition , 2002 .

[28]  Jean Yee Hwa Yang,et al.  Gene expression Identifying differentially expressed genes from microarray experiments via statistic synthesis , 2005 .

[29]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[31]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[32]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[33]  Jana Novovicová,et al.  Evaluating the Stability of Feature Selectors That Optimize Feature Subset Cardinality , 2008, SSPR/SPR.

[34]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[35]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[36]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[37]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[39]  Louise C. Showe,et al.  Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data , 2007, BMC Bioinformatics.

[40]  Rainer Spang,et al.  Similarities of Ordered Gene Lists , 2006, J. Bioinform. Comput. Biol..

[41]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  Yang Wang,et al.  Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data , 2005, IEEE ACM Trans. Comput. Biol. Bioinform..

[43]  Jian Huang,et al.  Identification of cancer-associated gene clusters and genes via clustering penalization. , 2009, Statistics and its interface.

[44]  Shuangge Ma,et al.  Identifying subset of genes that have influential impacts on cancer progression: a new approach to analyze cancer microarray data , 2008, Functional & Integrative Genomics.

[45]  Jin-Kao Hao,et al.  Advances in metaheuristics for gene selection and classification of microarray data , 2010, Briefings Bioinform..

[46]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[47]  Hui Xiao,et al.  Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes , 2009, Bioinform..

[48]  Francisco Azuaje,et al.  Computational biology for cardiovascular biomarker discovery , 2009, Briefings Bioinform..

[49]  M. Verma,et al.  Proteomics for cancer biomarker discovery. , 2002, Clinical chemistry.

[50]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[51]  Melanie Hilario,et al.  Approaches to dimensionality reduction in proteomic biomarker studies , 2007, Briefings Bioinform..

[52]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[53]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[54]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.

[55]  Jesper Tegnér,et al.  On reliable discovery of molecular signatures , 2009, BMC Bioinformatics.

[56]  Serban Nacu,et al.  Gene expression network analysis and applications to immunology , 2007, Bioinform..

[57]  Michael L. Bittner,et al.  Strong Feature Sets from Small Samples , 2002, J. Comput. Biol..

[58]  Mukesh Verma,et al.  Proteomics for Cancer Biomarker Discovery , 2002 .

[59]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[60]  P. Cunningham,et al.  Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection , 2002 .

[61]  Yanqing Zhang,et al.  Recursive Fuzzy Granulation for Gene Subsets Extraction and Cancer Classification , 2008, IEEE Transactions on Information Technology in Biomedicine.

[62]  Qing Wang,et al.  Towards precise classification of cancers based on robust gene functional expression profiles , 2005, BMC Bioinformatics.

[63]  Xi Chen,et al.  Integrating Biological Knowledge with Gene Expression Profiles for Survival Prediction of Cancer , 2009, J. Comput. Biol..

[64]  Caroline C. Friedel,et al.  Reliable gene signatures for microarray classification: assessment of stability and performance , 2006, Bioinform..

[65]  Xin Yao,et al.  Diversity creation methods: a survey and categorisation , 2004, Inf. Fusion.

[66]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[67]  Seon-Young Kim,et al.  Gene-set approach for expression pattern analysis , 2008, Briefings Bioinform..

[68]  Qi Liu,et al.  Gene-set analysis and reduction , 2008, Briefings Bioinform..

[69]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[70]  T. Chow,et al.  Effective Gene Selection Method With Small Sample Sets Using Gradient-Based and Point Injection Techniques , 2007, TCBB.

[71]  Trevor Hastie,et al.  Averaged gene expressions for regression. , 2007, Biostatistics.

[72]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[73]  Aleix M. Martínez,et al.  Using the information embedded in the testing sample to break the limits caused by the small sample size in microarray-based classification , 2008, BMC Bioinformatics.

[74]  Maria Joseph,et al.  Guilt-by-association feature selection: Identifying biomarkers from proteomic profiles , 2008, J. Biomed. Informatics.

[75]  Jian Huang,et al.  Penalized feature selection and classification in bioinformatics , 2008, Briefings Bioinform..

[76]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[77]  Jing Zhu,et al.  Apparently low reproducibility of true differential expression discoveries in microarray studies , 2008, Bioinform..

[78]  Anne-Laure Boulesteix,et al.  Stability and aggregation of ranked gene lists , 2009, Briefings Bioinform..

[79]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[80]  Manfred Jaeger,et al.  Proceedings of the 24th Annual International Conference on Machine Learning (ICML 2007) , 2007, ICML 2007.

[81]  Seon-Young Kim,et al.  Gene-set approach for expression pattern analysis , 2008, Briefings Bioinform..

[82]  Melanie Hilario,et al.  Stability of feature selection algorithms , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[83]  Thibault Helleputte,et al.  Partially supervised feature selection with regularized linear models , 2009, ICML '09.

[84]  Andrew Steele,et al.  A Robust Biomarker , 2000 .

[85]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[86]  Jesper Tegnér,et al.  Consistent Feature Selection for Pattern Recognition in Polynomial Time , 2007, J. Mach. Learn. Res..

[87]  Doheon Lee,et al.  Inferring Pathway Activity toward Precise Disease Classification , 2008, PLoS Comput. Biol..

[88]  Jian Li,et al.  Iterative RELIEF for feature weighting , 2006, ICML.

[89]  Anna Gambin,et al.  On consensus biomarker selection , 2007, BMC Bioinformatics.

[90]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[91]  Louise C. Showe,et al.  Classification and biomarker identification using gene network modules and support vector machines , 2009, BMC Bioinformatics.

[92]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[93]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[94]  Jian Huang,et al.  Clustering threshold gradient descent regularization: with applications to microarray studies , 2007, Bioinform..

[95]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[96]  Thibault Helleputte,et al.  Feature Selection by Transfer Learning with Linear Regularized Models , 2009, ECML/PKDD.