A phase diagram for gene selection and disease classification

Abstract Identifying a small subset of genes that can classify disease samples from healthy controls plays an import role for evaluating disease risk and facilitating diagnosis. Existing methods often provide a single metric to assess predictive performances of genes. Also, model-based gene importance is conditioned on the subset of genes used to build multivariate models, and is thus model/context-specific. Existing methods often do not take into account such context-specific effects. Here we present a novel gene selection approach that evaluates predictive performance of genes using two criteria by taking into account gene interactions and project them onto four different regions in a 2-dimensional plot, like a phase diagram (PHADIA) in chemistry. Using two publicly available microarray datasets, we showed that PHADIA achieves comparable or better classification accuracies compared to reported results in the literature. The source codes are freely available at: www.libpls.net .

[1]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[2]  Minoru Toyota,et al.  Integrated genetic and epigenetic analysis identifies three different subclasses of colon cancer , 2007, Proceedings of the National Academy of Sciences.

[3]  K. J. Ray Liu,et al.  Dependence network modeling for biomarker identification , 2007, Bioinform..

[4]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[5]  Wei Chu,et al.  Biomarker discovery in microarray gene expression data with Gaussian processes , 2005, Bioinform..

[6]  Dong-Sheng Cao,et al.  A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. , 2014, Analytica chimica acta.

[7]  Vipin Kumar,et al.  Robust and efficient identification of biomarkers by classifying features on graphs , 2008, Bioinform..

[8]  Qing-Song Xu,et al.  Support vector machines and its applications in chemistry , 2009 .

[9]  Yizeng Liang,et al.  Erratum to: Informative metabolites identification by variable importance analysis based on random variable combination , 2015, Metabolomics.

[10]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[11]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[12]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[13]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[14]  S. Dhanasekaran,et al.  Delineation of prognostic biomarkers in prostate cancer , 2001, Nature.

[15]  Knut Baumann,et al.  Cross-validation as the objective function for variable-selection techniques , 2003 .

[16]  Proceedings of the German Conference on Bioinformatics, GCB 2003, October 12-14, 2003, Neuherberg/Garching near Munich, Germany , 2003, German Conference on Bioinformatics.

[17]  Mike West,et al.  Prediction and uncertainty in the analysis of gene expression profiles , 2002, Silico Biol..

[18]  R H Hruban,et al.  Gene expression profiles in normal and cancer cells. , 1997, Science.

[19]  Xuefeng Bruce Ling,et al.  Multiclass cancer classification and biomarker discovery using GA-based algorithms , 2005, Bioinform..

[20]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[21]  Debashis Ghosh,et al.  Classification and Selection of Biomarkers in Genomic Data Using LASSO , 2005, Journal of biomedicine & biotechnology.

[22]  Johan A. K. Suykens,et al.  Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction , 2004, Bioinform..

[23]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[24]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[25]  Dong-Sheng Cao,et al.  Model population analysis for variable selection , 2010 .

[26]  Dong-Sheng Cao,et al.  Recipe for revealing informative metabolites based on model population analysis , 2010, Metabolomics.

[27]  Yi-Zeng Liang,et al.  Plasma fatty acid metabolic profiling and biomarkers of type 2 diabetes mellitus based on GC/MS and PLS‐LDA , 2006, FEBS letters.

[28]  Y. Guan,et al.  The emerging era of genomic data integration for analyzing splice isoform function. , 2014, Trends in genetics : TIG.

[29]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Dong-Sheng Cao,et al.  Model-population analysis and its applications in chemical and biological modeling , 2012 .

[31]  Dong-Sheng Cao,et al.  Recipe for uncovering predictive genes using support vector machines based on model population analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[34]  Carl Virtanen,et al.  Integrated classification of lung tumors and cell lines by expression profiling , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[35]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[36]  D Williamson,et al.  Comparative expressed sequence hybridization to chromosomes for tumor classification and identification of genomic regions of differential gene expression , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[38]  Jian Huang,et al.  Regularized binormal ROC method in disease classification using microarray data , 2005, BMC Bioinformatics.

[39]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[40]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[41]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.