Evaluation of causal Bayesian network search algorithms using simulated mesotheliomas gene expression data

To understand the physiology of a complex disease, such as mesotheliomas, it is necessary to learn how the genes that are involved in developing the disease interact with the environment. To this end, statistical methods that can detect these gene-environment interactions will help scientists in detecting causal relationships among genes. These predicted causal relationships among genes can then be later verified through actual laboratory experiments. In this paper, we have developed a novel causal discovery system that incorporates recent advances in Bayesian network search methods. We introduce a novel algorithm called Equivalence Checking Local Implicit latent variable scoring Method with Markov Chain Monte Carlo (EquLIM-MCMC) search algorithm that extends existing causal Bayesian network discovery algorithms, EquLIM and the Local Implicit latent variable scoring Method (LIM). Markov Chain Monte Carlo (MCMC) search has been shown to be very useful especially in analyzing datasets where the number of input variables greatly exceeds the number of cases that are collected (Friedman and Koller 2000; Hageman, Leduc et al. 2011). More and more datasets that are collected for gene expression studies have thousands of genes' expression levels (input variables) that are measured from tens or hundreds of subjects (cases). Datasets collected in gene-environment interactions studies will show similar trends. We use LIM with MCMC (LIM-MCMC) and EquLIM-MCMC to analyze purely observational simulated gene expression datasets. To test these algorithms' abilities to detect causal relationships from realistic data, we generate datasets from a gene regulation pathway model of malignant mesothelioma formation proposed by an expert. Using the metrics of Area Under Receiver Operating Characteristic (AUROC) curve, Positive Predictive Value (PPV), and Shannon Entropy, we show that EquLIM-MCMC exhibit clear advantages over LIM-MCMC with causal relationship predictions. EquLIM-MCMC therefore improves over LIM-MCMC's ability in detecting causal relationships in gene networks and gene-environment interactions from presently available observational gene expression data.

[1]  J. Testa,et al.  Asbestos, chromosomal deletions, and tumor suppressor gene alterations in human malignant mesothelioma , 1999, Journal of cellular physiology.

[2]  Gregory F. Cooper,et al.  Causal Discovery from a Mixture of Experimental and Observational Data , 1999, UAI.

[3]  Gregory F. Cooper,et al.  Discovery of gene-regulation pathways using local causal search , 2002, AMIA.

[4]  Marco Grzegorczyk,et al.  Modelling non-stationary gene regulatory processes with a non-homogeneous Bayesian network and the allocation sampler , 2008, Bioinform..

[5]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[6]  D. Husmeier,et al.  Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge , 2007, Statistical applications in genetics and molecular biology.

[7]  Erik M. Brilz,et al.  The Five‐Gene‐Network Data Analysis with Local Causal Discovery Algorithm Using Causal Bayesian Networks , 2009, Annals of the New York Academy of Sciences.

[8]  Changwon Yoo,et al.  The Bayesian method for causal discovery of latent-variable models from a mixture of experimental and observational data , 2012, Comput. Stat. Data Anal..

[9]  Ron Korstanje,et al.  A Bayesian Framework for Inference of the Genotype–Phenotype Map for Segregating Populations , 2011, Genetics.

[10]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[11]  Nir Friedman,et al.  Being Bayesian about Network Structure , 2000, UAI.

[12]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[13]  Gregory F. Cooper,et al.  A Bayesian Method for Causal Modeling and Discovery Under Selection , 2000, UAI.

[14]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[15]  Richard Scheines,et al.  The Tetrad Project , 1990 .

[16]  Paul P. Wang,et al.  Advances to Bayesian network inference for generating causal networks from observational biological data , 2004, Bioinform..