Determining the Number of Non-Spurious Arcs in a Learned DAG Model: Investigation of a Bayesian and a Frequentist Approach

In many application domains, such as computational biology, the goal of graphical model structure learning is to uncover discrete relationships between entities. For example, in our problem of interest concerning HIV vaccine design, we want to infer which HIV peptides interact with which immune system molecules (HLA molecules). For problems of this nature, we are interested in determining the number of non-spurious arcs in a learned graphical model. We describe both a Bayesian and frequentist approach to this problem. In the Bayesian approach, we use the posterior distribution over model structures to compute the expected number of true arcs in a learned model. In the frequentist approach, we develop a method based on the concept of the False Discovery Rate. On synthetic data sets generated from models similar to the ones learned, we find that both the Bayesian and frequentist approaches yield accurate estimates of the number of non-spurious arcs. In addition, we speculate that the frequentist approach, which is non-parametric, may outperform the parametric Bayesian approach in situations where the models learned are less representative of the data. Finally, we apply the frequentist approach to our problem of HIV vaccine design.

[1]  Gregory F. Cooper,et al.  The ALARM Monitoring System: A Case Study with two Probabilistic Inference Techniques for Belief Networks , 1989, AIME.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  David Heckerman,et al.  Causal independence for probability assessment and inference using Bayesian networks , 1996, IEEE Trans. Syst. Man Cybern. Part A.

[4]  Nir Friedman,et al.  Data Analysis with Bayesian Networks: A Bootstrap Approach , 1999, UAI.

[5]  Nir Friedman,et al.  On the application of the bootstrap for computing confidence measures on features of induced Bayesian networks , 1999, AISTATS.

[6]  E. Rosenberg,et al.  Rapid Definition of Five Novel HLA-A∗3002-Restricted Human Immunodeficiency Virus-Specific Cytotoxic T-Lymphocyte Epitopes by Elispot and Intracellular Cytokine Staining Assays , 2001, Journal of Virology.

[7]  John D. Storey A direct approach to false discovery rates , 2002 .

[8]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[9]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[10]  J. Peña Learning and Validating Bayesian Network Models of Genetic Regulatory Networks , 2004 .

[11]  Nir Friedman,et al.  Being Bayesian About Network Structure. A Bayesian Approach to Structure Discovery in Bayesian Networks , 2004, Machine Learning.

[12]  Bradley Efron,et al.  Local False Discovery Rates , 2005 .

[13]  Korbinian Strimmer,et al.  An empirical Bayes approach to inferring large-scale gene association networks , 2005, Bioinform..

[14]  Tom Heskes,et al.  Symmetric Causal Independence Models for Classification , 2006, Probabilistic Graphical Models.

[15]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[16]  Jose M. Peña,et al.  Learning and Validating Bayesian Network Models of Gene Networks , 2007 .