A graphical model method for integrating multiple sources of genome-scale data

Abstract Making effective use of multiple data sources is a major challenge in modern bioinformatics. Genome-wide data such as measures of transcription factor binding, gene expression, and sequence conservation, which are used to identify binding regions and genes that are important to major biological processes such as development and disease, can be difficult to use together due to the different biological meanings and statistical distributions of the heterogeneous data types, but each can provide valuable information for understanding the processes under study. Here we present methods for integrating multiple data sources to gain a more complete picture of gene regulation and expression. Our goal is to identify genes and cis-regulatory regions which play specific biological roles. We describe a graphical mixture model approach for data integration, examine the effect of using different model topologies, and discuss methods for evaluating the effectiveness of the models. Model fitting is computationally efficient and produces results which have clear biological and statistical interpretations. The Hedgehog and Dorsal signaling pathways in Drosophila, which are critical in embryonic development, are used as examples.

[1]  James Xu,et al.  Statistical modelling and inference for multivariate and longitudinal discrete response data , 1996 .

[2]  Katerina Kechris,et al.  Hedgehog targets in the Drosophila embryo and the mechanisms that generate tissue-specific outputs of Hedgehog signaling , 2010, Development.

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[5]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Cinzia Viroli,et al.  Dimensionally Reduced Model-Based Clustering Through Mixtures of Factor Mixture Analyzers , 2010, J. Classif..

[7]  T. J. Donohoe,et al.  Growth and differentiation in the Drosophila eye coordinated by hedgehog , 1995, Nature.

[8]  B. Efron Size, power and false discovery rates , 2007, 0710.2245.

[9]  Diego Villar,et al.  Genome-wide identification of hypoxia-inducible factor binding sites and target genes by a probabilistic model integrating transcription-profiling data and in silico binding site prediction , 2010, Nucleic acids research.

[10]  Michael R. Seringhaus,et al.  Predicting essential genes in fungal genomes. , 2006, Genome research.

[11]  Yuan Ji,et al.  Applications of beta-mixture models in bioinformatics , 2005, Bioinform..

[12]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[13]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[14]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[15]  R. Nusse,et al.  Hedgehog signaling regulates transcription through cubitus interruptus, a sequence-specific DNA binding protein. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[16]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[17]  R. Goodman,et al.  The interaction between the coactivator dCBP and Modulo, a chromatin-associated factor, affects segmentation and melanotic tumor formation in Drosophila , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Jay Magidson,et al.  Hierarchical Mixture Models for Nested Data Structures , 2004, GfKl.

[19]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[20]  G. Hon,et al.  Next-generation genomics: an integrative approach , 2010, Nature Reviews Genetics.

[21]  Shili Lin,et al.  Class discovery and classification of tumor samples using mixture modeling of gene expression data - a unified approach , 2004, Bioinform..

[22]  Michael J. MacCoss,et al.  A nested mixture model for protein identification using mass spectrometry , 2010, 1011.2087.

[23]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[24]  Wei Pan,et al.  A Bayesian approach to joint modeling of protein–DNA binding, gene expression and sequence data , 2010, Statistics in medicine.

[25]  Christophe Biernacki,et al.  Simultaneous Gaussian model-based clustering for samples of multiple origins , 2013, Comput. Stat..

[26]  Rebecka Jörnsten,et al.  Mixture models with multiple levels, with application to the analysis of multifactor gene expression data. , 2008, Biostatistics.

[27]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[28]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[29]  I. Bell,et al.  Comprehensive identification of Drosophila dorsal–ventral patterning genes using a whole-genome tiling array , 2006, Proceedings of the National Academy of Sciences.

[30]  Vanessa M Kvam,et al.  A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. , 2012, American journal of botany.

[31]  Nello Cristianini,et al.  Discovering Transcriptional Modules from Motif, Chip-Chip and Microarray Data , 2004, Pacific Symposium on Biocomputing.

[32]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[33]  John D. Storey A direct approach to false discovery rates , 2002 .

[34]  Trevor Hastie,et al.  Imputing Missing Data for Gene Expression Arrays , 2001 .

[35]  Michael Q. Zhang,et al.  ChIP-Array: combinatory analysis of ChIP-seq/chip and microarray gene expression data to discover direct/indirect targets of a transcription factor , 2011, Nucleic Acids Res..

[36]  A. Azzalini The Skew‐normal Distribution and Related Multivariate Families * , 2005 .

[37]  J. Winderickx,et al.  Inferring transcriptional modules from ChIP-chip, motif and microarray data , 2006, Genome Biology.

[38]  Korbinian Strimmer,et al.  A unified approach to false discovery rate estimation , 2008, BMC Bioinformatics.

[39]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.