A Bayesian integration model for improved gene functional inference from heterogeneous data sources

Increasing amounts of biological data from various sources are being made available by high-throughput genomic technologies. However, no single biological data source analysis can fully unravel the complexities of the hierarchical gene function prediction. Therefore, the integration of multiple data sources is required to acquire a more precise understanding of the role of genes in the living organisms. In this paper, we develop a Hierarchical Bayesian iNtegration algorithm, HiBiN, a general framework that uses Bayesian reasoning to integrate heterogeneous data sources for accurate gene function prediction. The system uses posterior probabilities to assign class memberships to samples using multiple data sources while maintaining the hierarchical constraint that governs the annotation of genes. We demonstrate that the integration of the diverse datasets significantly improves the classification quality for hierarchical gene function prediction in terms of several measures, compared to single-source prediction models and fused-flat model, which are the baselines we compared against.

[1]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[2]  Vipin Kumar,et al.  Incorporating functional inter-relationships into protein function prediction algorithms , 2009, BMC Bioinformatics.

[3]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[4]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[5]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Farshad Fotouhi,et al.  Hierarchical Boosting for Gene Function Prediction , 2010 .

[7]  Andrea Esuli,et al.  Boosting multi-label hierarchical text categorization , 2008, Information Retrieval.

[8]  Limsoon Wong,et al.  An efficient strategy for extensive integration of diverse biological data for protein function prediction , 2007, Bioinform..

[9]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[10]  Jason Weston,et al.  Learning Gene Functional Classifications from Multiple Data Types , 2002, J. Comput. Biol..

[11]  Goo Jun,et al.  Multi-class Boosting with Class Hierarchies , 2009, MCS.

[12]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[13]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[14]  Giorgio Valentini,et al.  True Path Rule Hierarchical Ensembles for Genome-Wide Gene Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.