Bayesian inference with historical data-based informative priors improves detection of differentially expressed genes

MOTIVATION Modern high-throughput biotechnologies such as microarray are capable of producing a massive amount of information for each sample. However, in a typical high-throughput experiment, only limited number of samples were assayed, thus the classical 'large p, small n' problem. On the other hand, rapid propagation of these high-throughput technologies has resulted in a substantial collection of data, often carried out on the same platform and using the same protocol. It is highly desirable to utilize the existing data when performing analysis and inference on a new dataset. RESULTS Utilizing existing data can be carried out in a straightforward fashion under the Bayesian framework in which the repository of historical data can be exploited to build informative priors and used in new data analysis. In this work, using microarray data, we investigate the feasibility and effectiveness of deriving informative priors from historical data and using them in the problem of detecting differentially expressed genes. Through simulation and real data analysis, we show that the proposed strategy significantly outperforms existing methods including the popular and state-of-the-art Bayesian hierarchical model-based approaches. Our work illustrates the feasibility and benefits of exploiting the increasingly available genomics big data in statistical inference and presents a promising practical strategy for dealing with the 'large p, small n' problem. AVAILABILITY AND IMPLEMENTATION Our method is implemented in R package IPBT, which is freely available from https://github.com/benliemory/IPBT CONTACT: yuzhu@purdue.edu; zhaohui.qin@emory.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Scott L. Zeger,et al.  The Analysis of Gene Expression Data: Methods and Software , 2013 .

[2]  Rafael A Irizarry,et al.  Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.

[3]  Irving John Good,et al.  The Estimation of Probabilities: An Essay on Modern Bayesian Methods , 1965 .

[4]  Yike Guo,et al.  Finding consistent disease subnetworks across microarray datasets , 2011, BMC Bioinformatics.

[5]  Wing Hung Wong,et al.  TileMap: create chromosomal map of tiling array hybridizations , 2005, Bioinform..

[6]  Antoine M. van Oijen,et al.  Real-time single-molecule observation of rolling-circle DNA replication , 2009, Nucleic acids research.

[7]  Damon Berridge,et al.  Robust Modeling of Differential Gene Expression Data Using Normal/Independent Distributions: A Bayesian Approach , 2015, PloS one.

[8]  Luca Tardella,et al.  Exploiting blank spots for model-based background correction in discovering genes with DNA array data , 2011 .

[9]  Zhenhua Li,et al.  A quantum leap in the reproducibility, precision, and sensitivity of gene expression profile analysis even when sample size is extremely small , 2015, J. Bioinform. Comput. Biol..

[10]  G. Churchill,et al.  Experimental design for gene expression microarrays. , 2001, Biostatistics.

[11]  Raphael Gottardo,et al.  Flexible empirical Bayes models for differential gene expression , 2007, Bioinform..

[12]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for GWAS meta-analysis , 2012, Nucleic acids research.

[13]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[14]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[15]  Terence P. Speed,et al.  Background Adjustment for DNA Microarrays Using a Database of Microarray Experiments , 2009, J. Comput. Biol..

[16]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[17]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[18]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[19]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[20]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[21]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  S. Richardson,et al.  Bayesian Modeling of Differential Gene Expression , 2006, Biometrics.

[23]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[24]  Zhijin Wu,et al.  Preprocessing of oligonucleotide array data , 2004, Nature Biotechnology.

[25]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[26]  G. Parmigiani,et al.  The Analysis of Gene Expression Data , 2003 .

[27]  Anna Liu,et al.  Bayesian meta-analysis models for microarray data: a comparative study , 2007, BMC Bioinformatics.

[28]  Limsoon Wong,et al.  Finding consistent disease subnetworks using PFSNet , 2014, Bioinform..

[29]  H. Parkinson,et al.  A global map of human gene expression , 2010, Nature Biotechnology.

[30]  Oliver Eulenstein,et al.  Maximum likelihood models and algorithms for gene tree evolution with duplications and losses , 2011, BMC Bioinformatics.

[31]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.

[32]  Hongkai Ji,et al.  Analyzing 'omics data using hierarchical models , 2010, Nature Biotechnology.