Improving Hierarchical Models Using Historical Data with Applications in High-Throughput Genomics Data Analysis

Modern high-throughput biotechnologies such as microarray and next-generation sequencing produce a massive amount of information for each sample assayed. However, in a typical high-throughput experiment, only limited amount of data are observed for each individual feature, thus the classical “large p, small n” problem. Bayesian hierarchical model, capable of borrowing strength across features within the same dataset, has been recognized as an effective tool in analyzing such data. However, the shrinkage effect, the most prominent feature of hierarchical features, can lead to undesirable over-correction for some features. In this work, we discuss possible causes of the over-correction problem and propose several alternative solutions. Our strategy is rooted in the fact that in the Big Data era, large amount of historical data are available which should be taken advantage of. Our strategy presents a new framework to enhance the Bayesian hierarchical model. Through simulation and real data analysis, we demonstrated superior performance of the proposed strategy. Our new strategy also enables borrowing information across different platforms which could be extremely useful with emergence of new technologies and accumulation of data from different platforms in the Big Data era. Our method has been implemented in R package “adaptiveHM,” which is freely available from https://github.com/benliemory/adaptiveHM.

[1]  G. Churchill,et al.  Experimental design for gene expression microarrays. , 2001, Biostatistics.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Keying Ye,et al.  Evaluating water quality using power priors to incorporate historical information , 2006 .

[4]  Zhaohui S. Qin,et al.  Detection of differentially methylated regions from whole-genome bisulfite sequencing data without replicates , 2015, Nucleic acids research.

[5]  Joseph G. Ibrahim,et al.  The relationship between the power prior and hierarchical models , 2006 .

[6]  H. Parkinson,et al.  A global map of human gene expression , 2010, Nature Biotechnology.

[7]  Zhaohui S. Qin,et al.  Base-resolution methylation patterns accurately predict transcription factor bindings in vivo , 2015, Nucleic acids research.

[8]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[9]  Peter J Park,et al.  Improving identification of differentially expressed genes in microarray studies using information from public databases , 2004, Genome Biology.

[10]  Irving John Good,et al.  The Estimation of Probabilities: An Essay on Modern Bayesian Methods , 1965 .

[11]  Mark D. Robinson,et al.  Statistical methods for detecting differentially methylated loci and regions , 2014, Front. Genet..

[12]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[13]  Raphael Gottardo,et al.  PICS: Probabilistic Inference for ChIP‐seq , 2009, Biometrics.

[14]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.

[15]  Pak Ching Li,et al.  Bayesian inference with historical data-based informative priors improves detection of differentially expressed genes , 2016, Bioinform..

[16]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[17]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[18]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[19]  Zhaohui S. Qin,et al.  Statistical Challenges in Analyzing Methylation and Long-Range Chromosomal Interaction Data , 2016, Statistics in Biosciences.

[20]  Russ B. Altman,et al.  Using Pre-existing Microarray Datasets to Increase Experimental Power: Application to Insulin Resistance , 2010, PLoS Comput. Biol..

[21]  Bradley P Carlin,et al.  Hierarchical Commensurate and Power Prior Models for Adaptive Incorporation of Historical Information in Clinical Trials , 2011, Biometrics.

[22]  K. Conneely,et al.  A Bayesian hierarchical model to detect differentially methylated loci from single nucleotide resolution sequencing data , 2014, Nucleic acids research.

[23]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[24]  Terence P. Speed,et al.  Background Adjustment for DNA Microarrays Using a Database of Microarray Experiments , 2009, J. Comput. Biol..

[25]  G. Parmigiani,et al.  The Analysis of Gene Expression Data , 2003 .

[26]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[27]  T. Mikkelsen,et al.  The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[28]  Hongkai Ji,et al.  Analyzing 'omics data using hierarchical models , 2010, Nature Biotechnology.

[29]  Joseph G Ibrahim,et al.  The power prior: theory and applications , 2015, Statistics in medicine.

[30]  Scott L. Zeger,et al.  The Analysis of Gene Expression Data: Methods and Software , 2013 .

[31]  Wing Hung Wong,et al.  TileMap: create chromosomal map of tiling array hybridizations , 2005, Bioinform..

[32]  Rafael A. Irizarry,et al.  Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays , 2014, Bioinform..

[33]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of human colon and rectal cancer , 2012, Nature.

[34]  Hao Wu,et al.  A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data , 2012, Biostatistics.

[35]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Ieno,et al.  [Statistics for Biology and Health] Mixed effects models and extensions in ecology with R Volume 413 || Introduction , 2009 .