Bayesian variable selection for disease classification using gene expression data

MOTIVATION An important application of gene expression microarray data is the classification of samples into categories. Accurate classification depends upon the method used to identify the most relevant genes. Owing to the large number of genes and relatively small sample size, the selection process can be unstable. Modification of existing methods for achieving better analysis of microarray data is needed. RESULTS We propose a Bayesian stochastic variable selection approach for gene selection based on a probit regression model with a generalized singular g-prior distribution for regression coefficients. Using simulation-based Markov chain Monte Carlo methods for simulating parameters from the posterior distribution, an efficient and dependable algorithm is implemented. It is also shown that this algorithm is robust to the choices of initial values, and produces posterior probabilities of related genes for biological interpretation. The performance of the proposed approach is compared with other popular methods in gene selection and classification via the well-known colon cancer and leukemia datasets in microarray literature. AVAILABILITY A free Matlab code to perform gene selection is available at http://www.sta.cuhk.edu.hk/xysong/geneselection/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.