Hybrid Bayesian-rank integration approach improves the predictive power of genomic dataset aggregation

MOTIVATION Modern molecular technologies allow the collection of large amounts of high-throughput data on the functional attributes of genes. Often multiple technologies and study designs are used to address the same biological question such as which genes are overexpressed in a specific disease state. Consequently, there is considerable interest in methods that can integrate across datasets to present a unified set of predictions. RESULTS An important aspect of data integration is being able to account for the fact that datasets may differ in how accurately they capture the biological signal of interest. While many methods to address this problem exist, they always rely either on dataset internal statistics, which reflect data structure and not necessarily biological relevance, or external gold standards, which may not always be available. We present a new rank aggregation method for data integration that requires neither external standards nor internal statistics but relies on Bayesian reasoning to assess dataset relevance. We demonstrate that our method outperforms established techniques and significantly improves the predictive power of rank-based aggregations. We show that our method, which does not require an external gold standard, provides reliable estimates of dataset relevance and allows the same set of data to be integrated differently depending on the specific signal of interest. AVAILABILITY The method is implemented in R and is freely available at http://www.pitt.edu/~mchikina/BIRRA/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Fred H. Gage,et al.  Nanog binds to Smad1 and blocks bone morphogenetic protein-induced differentiation of embryonic stem cells , 2006, Proceedings of the National Academy of Sciences.

[2]  Joshua T. Burdick,et al.  Common genetic variants account for differences in gene expression among ethnic groups , 2007, Nature Genetics.

[3]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for microarray meta-analysis , 2012, Nucleic acids research.

[4]  Sven Laur,et al.  Robust rank aggregation for gene list integration and meta-analysis , 2012, Bioinform..

[5]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for GWAS meta-analysis , 2012, Nucleic acids research.

[6]  Jeffrey T Leek,et al.  On the design and analysis of gene expression studies in human populations , 2007, Nature Genetics.

[7]  Darren J. Wilkinson,et al.  Bayesian integration of networks without gold standards , 2012, Bioinform..

[8]  G. Sumara,et al.  A Probabilistic Functional Network of Yeast Genes , 2004 .

[9]  Hugues Bersini,et al.  InSilico DB genomic datasets hub: an efficient starting point for analyzing genome-wide studies in GenePattern, Integrative Genomics Viewer, and R/Bioconductor , 2012, Genome Biology.

[10]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[11]  Yi Zhang,et al.  SUZ12 is required for both the histone methyltransferase activity and the silencing function of the EED-EZH2 complex. , 2004, Molecular cell.

[12]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[13]  Manuel B. Graeber,et al.  PGC-1α, A Potential Therapeutic Target for Early Intervention in Parkinson’s Disease , 2010, Science Translational Medicine.

[14]  A. Fraser,et al.  A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans , 2008, Nature Genetics.

[15]  Matthew A. Hibbs,et al.  Exploring the human genome with functional maps. , 2009, Genome research.

[16]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.