Gene Prediction Using Multinomial Probit Regression with Bayesian Gene Selection

A critical issue for the construction of genetic regulatory networks is the identification of network topology from data. In the context of deterministic and probabilistic Boolean networks, as well as their extension to multilevel quantization, this issue is related to the more general problem of expression prediction in which we want to find small subsets of genes to be used as predictors of target genes. Given some maximum number of predictors to be used, a full search of all possible predictor sets is combinatorially prohibitive except for small predictors sets, and even then, may require supercomputing. Hence, suboptimal approaches to finding predictor sets and network topologies are desirable. This paper considers Bayesian variable selection for prediction using a multinomial probit regression model with data augmentation to turn the multinomial problem into a sequence of smoothing problems. There are multiple regression equations and we want to select the same strongest genes for all regression equations to constitute a target predictor set or, in the context of a genetic network, the dependency set for the target. The probit regressor is approximated as a linear combination of the genes and a Gibbs sampler is employed to find the strongest genes. Numerical techniques to speed up the computation are discussed. After finding the strongest genes, we predict the target gene based on the strongest genes, with the coefficient of determination being used to measure predictor accuracy. Using malignant melanoma microarray data, we compare two predictor models, the estimated probit regressors themselves and the optimal full-logic predictor based on the selected strongest genes, and we compare these to optimal prediction without feature selection.

[1]  Edward R. Dougherty,et al.  Coefficient of determination in nonlinear signal processing , 2000, Signal Process..

[2]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..

[3]  Robert Kohn,et al.  Bayesian Variable Selection and Model Averaging in High-Dimensional Multinomial Nonparametric Regression , 2003 .

[4]  Xiaobo Zhou,et al.  Construction of genomic networks using mutual-information clustering and reversible-jump Markov-chain-Monte-Carlo predictor design , 2003, Signal Process..

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[7]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[8]  C. Robert Simulation of truncated normal variables , 2009, 0907.4010.

[9]  Sui Huang Gene expression profiling, genetic networks, and cellular states: an integrating concept for tumorigenesis and drug discovery , 1999, Journal of Molecular Medicine.

[10]  E. Dougherty,et al.  Gene perturbation and intervention in probabilistic Boolean networks. , 2002, Bioinformatics.

[11]  Kevin Murphy,et al.  Modelling Gene Expression Data using Dynamic Bayesian Networks , 2006 .

[12]  Stuart A. Kauffman,et al.  The origins of order , 1993 .

[13]  K Sivakumar,et al.  General nonlinear framework for the analysis of gene interaction via multivariate expression arrays. , 2000, Journal of biomedical optics.

[14]  Bin Yu,et al.  Simultaneous Gene Clustering and Subset Selection for Sample Classification Via MDL , 2003, Bioinform..

[15]  Edward R. Dougherty,et al.  Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks , 2002, Bioinform..

[16]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[17]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[18]  E. Dougherty,et al.  Multivariate measurement of gene expression relationships. , 2000, Genomics.

[19]  D. V. Dyk,et al.  A Bayesian analysis of the multinomial probit model using marginal data augmentation , 2005 .

[20]  Edward R. Dougherty,et al.  Parallel computing methods for analyzing gene expression relationships , 2001, SPIE BiOS.

[21]  Satoru Miyano,et al.  Identification of Genetic Networks from a Small Number of Gene Expression Patterns Under the Boolean Network Model , 1998, Pacific Symposium on Biocomputing.

[22]  Michael L. Bittner,et al.  Efficient selection of feature sets possessing high coefficients of determination based on incremental determinations , 2003, Signal Process..

[23]  Jaakko Astola,et al.  On the Use of MDL Principle in Gene Expression Prediction , 2001, EURASIP J. Adv. Signal Process..

[24]  Edward I. George,et al.  The Practical Implementation of Bayesian Model Selection , 2001 .

[25]  I. Mian,et al.  Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae. , 2000, Physiological genomics.

[26]  Nir Friedman,et al.  Inferring subnetworks from perturbed expression profiles , 2001, ISMB.

[27]  Y. Chen,et al.  Ratio-based decisions and the quantitative analysis of cDNA microarray images. , 1997, Journal of biomedical optics.

[28]  Edward R. Dougherty,et al.  CAN MARKOV CHAIN MODELS MIMIC BIOLOGICAL REGULATION , 2002 .