Graphical-model Based Multiple Testing under Dependence, with Applications to Genome-wide Association Studies

Large-scale multiple testing tasks often exhibit dependence, and leveraging the dependence between individual tests is still one challenging and important problem in statistics. With recent advances in graphical models, it is feasible to use them to perform multiple testing under dependence. We propose a multiple testing procedure which is based on a Markov-random-field-coupled mixture model. The ground truth of hypotheses is represented by a latent binary Markov random-field, and the observed test statistics appear as the coupled mixture variables. The parameters in our model can be automatically learned by a novel EM algorithm. We use an MCMC algorithm to infer the posterior probability that each hypothesis is null (termed local index of significance), and the false discovery rate can be controlled accordingly. Simulations show that the numerical performance of multiple testing can be improved substantially by using our procedure. We apply the procedure to a real-world genome-wide association study on breast cancer, and we identify several SNPs with strong association evidence.

[1]  W. G. Cochran Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[2]  S. Nilsson,et al.  The over-expression of HAS2, Hyal-2 and CD44 is implicated in the invasiveness of breast cancer. , 2005, Experimental cell research.

[3]  L. Wasserman,et al.  False discovery control with p-value weighting , 2006 .

[4]  Joseph L. Gastwirth,et al.  Trend Tests for Case-Control Studies of Genetic Markers: Power, Sample Size and Robustness , 2002, Human Heredity.

[5]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR 2 associated with risk of sporadic postmenopausal breast cancer , 2012 .

[6]  Peter Donnelly,et al.  HAPGEN2: simulation of multiple disease SNPs , 2011, Bioinform..

[7]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[8]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[9]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[10]  P. Heldin,et al.  Silencing of hyaluronan synthase 2 suppresses the malignant phenotype of invasive breast cancer cells , 2007, International journal of cancer.

[11]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[12]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[13]  Y. Benjamini,et al.  False Discovery Rates for Spatial Signals , 2007 .

[14]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[15]  Y. Benjamini,et al.  Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[16]  Wenguang Sun,et al.  Large‐scale multiple testing under dependence , 2009 .

[17]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[18]  Stephen M. Smith,et al.  Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm , 2001, IEEE Transactions on Medical Imaging.

[19]  Deanne M. Taylor,et al.  Powerful SNP-set analysis for case-control genome-wide association studies. , 2010, American journal of human genetics.

[20]  Chloé Friguet,et al.  A Factor Model Approach to Multiple Testing Under Dependence , 2009 .

[21]  A. Farcomeni Some Results on the Control of the False Discovery Rate under Dependence , 2007 .

[22]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[23]  †The International HapMap Consortium The International HapMap Project , 2003, Nature.

[24]  A. Owen Variance of the number of false discoveries , 2005 .

[25]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[26]  Yair Weiss,et al.  Correctness of Local Probability Propagation in Graphical Models with Loops , 2000, Neural Computation.

[27]  Daphne Koller,et al.  Constrained Approximate Maximum Entropy Learning of Markov Random Fields , 2008, UAI.

[28]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[29]  D. Schaid,et al.  Case-Control Studies of Genetic Markers: Power and Sample Size Approximations for Armitage’s Test for Trend , 2001, Human Heredity.

[30]  Padhraic Smyth,et al.  Particle Filtered MCMC-MLE with Connections to Contrastive Divergence , 2010, ICML.

[31]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[32]  P. Heldin,et al.  Hyaluronan Synthase 2 (HAS2) Promotes Breast Cancer Cell Invasion by Suppression of Tissue Metalloproteinase Inhibitor 1 (TIMP-1)* , 2011, The Journal of Biological Chemistry.

[33]  Martin J. Wainwright,et al.  Tree-based reparameterization framework for analysis of sum-product and related algorithms , 2003, IEEE Trans. Inf. Theory.

[34]  C. McCarty,et al.  Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. , 2005, Personalized medicine.

[35]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[36]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[37]  H. Finner,et al.  Multiple hypotheses testing and expected number of type I. errors , 2002 .

[38]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[39]  Martin J. Wainwright,et al.  Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching , 2003, AISTATS.

[40]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[41]  S. Sarkar False discovery and false nondiscovery rates in single-step multiple testing procedures , 2006, math/0605607.

[42]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[43]  P. Armitage Tests for Linear Trends in Proportions and Frequencies , 1955 .

[44]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[45]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer , 2007, Nature Genetics.

[46]  W. Wu,et al.  On false discovery control under dependence , 2008, 0803.1971.

[47]  Tao Yu,et al.  MULTIPLE TESTING VIA FDRL FOR LARGE SCALE IMAGING DATA , 2011 .

[48]  Nicol N. Schraudolph,et al.  Efficient Exact Inference in Planar Ising Models , 2008, NIPS.

[49]  Gilles Blanchard,et al.  Adaptive False Discovery Rate Control under Independence and Dependence , 2009, J. Mach. Learn. Res..

[50]  Max Welling,et al.  Learning in Markov Random Fields with Contrastive Free Energies , 2005, AISTATS.

[51]  Wenguang Sun,et al.  Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control , 2007 .

[52]  Nic Schraudolph,et al.  Polynomial-Time Exact Inference in NP-Hard Binary MRFs via Reweighted Perfect Matching , 2010, AISTATS.

[53]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[54]  Sotirios Chatzis,et al.  A Fuzzy Clustering Approach Toward Hidden Markov Random Field Models for Enhanced Spatially Constrained Image Segmentation , 2008, IEEE Transactions on Fuzzy Systems.

[55]  L. Wasserman,et al.  A stochastic process approach to false discovery control , 2004, math/0406519.

[56]  Jianqing Fan,et al.  Control of the False Discovery Rate Under Arbitrary Covariance Dependence , 2010, 1012.4397.

[57]  Gilles Celeux,et al.  EM procedures using mean field-like approximations for Markov model-based image segmentation , 2003, Pattern Recognit..

[58]  Robert D. Nowak,et al.  Wavelet-based statistical signal processing using hidden Markov models , 1998, IEEE Trans. Signal Process..

[59]  Ruslan Salakhutdinov,et al.  Learning in Markov Random Fields using Tempered Transitions , 2009, NIPS.

[60]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[61]  Lester L. Peters,et al.  Genome-wide association study identifies novel breast cancer susceptibility loci , 2007, Nature.

[62]  Joseph P. Romano,et al.  Control of the false discovery rate under dependence using the bootstrap and subsampling , 2008 .

[63]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[64]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[65]  John D. Storey A direct approach to false discovery rates , 2002 .