A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae)

Genomic sequencing is no longer a novelty, but gene function annotation remains a key challenge in modern biology. A variety of functional genomics experimental techniques are available, from classic methods such as affinity precipitation to advanced high-throughput techniques such as gene expression microarrays. In the future, more disparate methods will be developed, further increasing the need for integrated computational analysis of data generated by these studies. We address this problem with magic (Multisource Association of Genes by Integration of Clusters), a general framework that uses formal Bayesian reasoning to integrate heterogeneous types of high-throughput biological data (such as large-scale two-hybrid screens and multiple microarray analyses) for accurate gene function prediction. The system formally incorporates expert knowledge about relative accuracies of data sources to combine them within a normative framework. magic provides a belief level with its output that allows the user to vary the stringency of predictions. We applied magic to Saccharomyces cerevisiae genetic and physical interactions, microarray, and transcription factor binding sites data and assessed the biological relevance of gene groupings using Gene Ontology annotations produced by the Saccaromyces Genome Database. We found that by creating functional groupings based on heterogeneous data types, magic improved accuracy of the groupings compared with microarray analysis alone. We describe several of the biological gene groupings identified.

[1]  B. Bainbridge,et al.  Genetics , 1981, Experientia.

[2]  K Mosbach,et al.  Affinity precipitation of enzymes , 1979, FEBS letters.

[3]  Per-Olof Larsson,et al.  Affinity precipitation of enzymes , 1982, Applied biochemistry and biotechnology.

[4]  Judea Pearl,et al.  Chapter 2 – BAYESIAN INFERENCE , 1988 .

[5]  D Botstein,et al.  Suppressors of yeast actin mutations. , 1989, Genetics.

[6]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[7]  S. Fields,et al.  A novel genetic system to detect protein–protein interactions , 1989, Nature.

[8]  David Heckerman,et al.  Probabilistic similarity networks , 1991, Networks.

[9]  J. Pringle,et al.  Use of a screen for synthetic lethal and multicopy suppressee mutants to identify two new genes involved in morphogenesis in Saccharomyces cerevisiae , 1991, Molecular and cellular biology.

[10]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[11]  C. Will,et al.  Protein functions in pre-mRNA splicing. , 1997, Current opinion in cell biology.

[12]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[13]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[14]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[15]  W. Tsai,et al.  Cef1p Is a Component of the Prp19p-associated Complex and Essential for Pre-mRNA Splicing* , 1999, The Journal of Biological Chemistry.

[16]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[17]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[18]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[19]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[20]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[21]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[22]  M. Saraste,et al.  FEBS Lett , 2000 .

[23]  Ben Taskar,et al.  Rich probabilistic models for gene expression , 2001, ISMB.

[24]  M. Snyder,et al.  A genomic study of the bipolar bud site selection pattern in Saccharomyces cerevisiae. , 2001, Molecular biology of the cell.

[25]  Edward M. Marcotte,et al.  Exploiting Big Biology: Integrating Large-scale Biological Data for Function Inference , 2001, Briefings Bioinform..

[26]  Gary D Bader,et al.  Analyzing yeast protein–protein interaction data obtained from different sources , 2002, Nature Biotechnology.

[27]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[28]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[29]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO) , 2002, Nucleic Acids Res..

[30]  T. van Laar,et al.  A role for Rad23 proteins in 26S proteasome-dependent protein degradation? , 2002, Mutation research.

[31]  Satoru Miyano,et al.  Estimation of Genetic Networks and Functional Structures Between Genes by Using Bayesian Networks and Nonparametric Regression , 2001, Pacific Symposium on Biocomputing.

[32]  Jason Weston,et al.  Learning Gene Functional Classifications from Multiple Data Types , 2002, J. Comput. Biol..

[33]  Mike Tyers,et al.  The GRID: The General Repository for Interaction Datasets , 2003, Genome Biology.

[34]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..