Caring without sharing: Meta-analysis 2.0 for massive genome-wide association studies

Genome-wide association studies have been effective at revealing the genetic architecture of simple traits. Extending this approach to more complex phenotypes has necessitated a massive increase in cohort size. To achieve sufficient power, participants are recruited across multiple collaborating institutions, leaving researchers with two choices: either collect all the raw data at a single institution or rely on meta-analyses to test for association. In this work, we present a third alternative. Here, we implement an entire GWAS workflow (quality control, population structure control, and association) in a fully decentralized setting. Our iterative approach (a) does not rely on consolidating the raw data at a single coordination center, and (b) does not hinge upon large sample size assumptions at each silo. As we show, our approach overcomes challenges faced by meta-studies when it comes to associating rare alleles and when case/control proportions are wildly imbalanced at each silo. We demonstrate the feasibility of our method in cohorts ranging in size from 2K (small) to 500K (large), and recruited across 2 to 10 collaborating institutions.

[1]  Joshua M. Korn,et al.  Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease , 2011, Nature Genetics.

[2]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[3]  Li Zhang,et al.  Analyze gauss: optimal bounds for privacy-preserving principal component analysis , 2014, STOC.

[4]  Balasubramanian Narasimhan,et al.  Software for Distributed Computation on Medical Databases: A Demonstration Project , 2014, ArXiv.

[5]  Dan Boneh,et al.  Deriving genomic diagnoses without revealing patient genomes , 2017, Science.

[6]  Quanyan Zhu,et al.  Dynamic Differential Privacy for ADMM-Based Distributed Classification Learning , 2017, IEEE Transactions on Information Forensics and Security.

[7]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[8]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[9]  Jan Graffelman,et al.  The mid p-value in exact tests for Hardy-Weinberg equilibrium , 2013, Statistical applications in genetics and molecular biology.

[10]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[11]  Ross M. Fraser,et al.  Genetic studies of body mass index yield new insights for obesity biology , 2015, Nature.

[12]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[13]  Wotao Yin,et al.  On the Global and Linear Convergence of the Generalized Alternating Direction Method of Multipliers , 2016, J. Sci. Comput..

[14]  M. Hestenes Multiplier and gradient methods , 1969 .

[15]  Elizabeth L. Ogburn,et al.  Demonstrating stratification in a European American population , 2005, Nature Genetics.

[16]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[17]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[18]  J. Danesh,et al.  A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease , 2016 .

[19]  John Novembre,et al.  The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. , 2008, American journal of human genetics.

[20]  Chang-Yun Lin,et al.  Blindly Using Wald's Test Can Miss Rare Disease‐Causal Variants in Case‐Control Association Studies , 2012, Annals of human genetics.

[21]  Stephen P. Boyd,et al.  Network Lasso: Clustering and Optimization in Large Graphs , 2015, KDD.

[22]  Gonzalo Mateos,et al.  Distributed Sparse Linear Regression , 2010, IEEE Transactions on Signal Processing.

[23]  B. Mercier,et al.  A dual algorithm for the solution of nonlinear variational problems via finite element approximation , 1976 .

[24]  David J. Wu,et al.  Secure genome-wide association analysis using multiparty computation , 2018, Nature Biotechnology.

[25]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[26]  T Greco,et al.  Review Article , 2022 .

[27]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[28]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[29]  Christopher R. Gignoux,et al.  Human demographic history impacts genetic risk prediction across diverse populations , 2016, bioRxiv.

[30]  Kathryn Roeder,et al.  A Method to Exploit the Structure of Genetic Ancestry Space to Enhance Case-Control Studies. , 2016, American journal of human genetics.

[31]  David C Hoaglin,et al.  We know less than we should about methods of meta‐analysis , 2015, Research synthesis methods.

[32]  N. Wray,et al.  A mega-analysis of genome-wide association studies for major depressive disorder , 2013, Molecular Psychiatry.

[33]  Inês Barroso,et al.  Genome-Wide Association Identifies Nine Common Variants Associated With Fasting Proinsulin Levels and Provides New Insights Into the Pathophysiology of Type 2 Diabetes , 2011, Diabetes.

[34]  P. Gregersen,et al.  Accounting for ancestry: population substructure and genome-wide association studies. , 2008, Human molecular genetics.

[35]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[36]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[37]  Peter Lubell-Doughtie,et al.  Practical distributed classification using the Alternating Direction Method of Multipliers algorithm , 2013, 2013 IEEE International Conference on Big Data.

[38]  Ayellet V. Segrè,et al.  Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis , 2010, Nature Genetics.

[39]  R. Glowinski,et al.  Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problèmes de Dirichlet non linéaires , 1975 .

[40]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[41]  Yiming Yang,et al.  Distributed training of Large-scale Logistic models , 2013, ICML.

[42]  Bonnie Berger,et al.  Enabling Privacy Preserving GWAS in Heterogeneous Human Populations , 2016, RECOMB.