论文信息 - Caring without sharing: Meta-analysis 2.0 for massive genome-wide association studies - 字舞流文

Caring without sharing: Meta-analysis 2.0 for massive genome-wide association studies

Genome-wide association studies have been effective at revealing the genetic architecture of simple traits. Extending this approach to more complex phenotypes has necessitated a massive increase in cohort size. To achieve sufficient power, participants are recruited across multiple collaborating institutions, leaving researchers with two choices: either collect all the raw data at a single institution or rely on meta-analyses to test for association. In this work, we present a third alternative. Here, we implement an entire GWAS workflow (quality control, population structure control, and association) in a fully decentralized setting. Our iterative approach (a) does not rely on consolidating the raw data at a single coordination center, and (b) does not hinge upon large sample size assumptions at each silo. As we show, our approach overcomes challenges faced by meta-studies when it comes to associating rare alleles and when case/control proportions are wildly imbalanced at each silo. We demonstrate the feasibility of our method in cohorts ranging in size from 2K (small) to 500K (large), and recruited across 2 to 10 collaborating institutions.

Armin Pourshafeie | Carlos Bustamante | Snehit Prabhu | C. Bustamante | Snehit Prabhu | Armin Pourshafeie

[1] Joshua M. Korn,et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease , 2011, Nature Genetics.

[2] Chao Yang,et al. ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[3] Li Zhang,et al. Analyze gauss: optimal bounds for privacy-preserving principal component analysis , 2014, STOC.

[4] Balasubramanian Narasimhan,et al. Software for Distributed Computation on Medical Databases: A Demonstration Project , 2014, ArXiv.

[5] Dan Boneh,et al. Deriving genomic diagnoses without revealing patient genomes , 2017, Science.

[6] Quanyan Zhu,et al. Dynamic Differential Privacy for ADMM-Based Distributed Classification Learning , 2017, IEEE Transactions on Information Forensics and Security.

[7] Ross M. Fraser,et al. Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[8] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[9] Jan Graffelman,et al. The mid p-value in exact tests for Hardy-Weinberg equilibrium , 2013, Statistical applications in genetics and molecular biology.

[10] Eric Jones,et al. SciPy: Open Source Scientific Tools for Python , 2001 .

[11] Ross M. Fraser,et al. Genetic studies of body mass index yield new insights for obesity biology , 2015, Nature.

[12] Alkes L. Price,et al. New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[13] Wotao Yin,et al. On the Global and Linear Convergence of the Generalized Alternating Direction Method of Multipliers , 2016, J. Sci. Comput..

[14] M. Hestenes. Multiplier and gradient methods , 1969 .

[15] Elizabeth L. Ogburn,et al. Demonstrating stratification in a European American population , 2005, Nature Genetics.

[16] Jorge Nocedal,et al. A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[17] Ohad Shamir,et al. Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[18] J. Danesh,et al. A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease , 2016 .

[19] John Novembre,et al. The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. , 2008, American journal of human genetics.

[20] Chang-Yun Lin,et al. Blindly Using Wald's Test Can Miss Rare Disease‐Causal Variants in Case‐Control Association Studies , 2012, Annals of human genetics.

[21] Stephen P. Boyd,et al. Network Lasso: Clustering and Optimization in Large Graphs , 2015, KDD.

[22] Gonzalo Mateos,et al. Distributed Sparse Linear Regression , 2010, IEEE Transactions on Signal Processing.

[23] B. Mercier,et al. A dual algorithm for the solution of nonlinear variational problems via finite element approximation , 1976 .

[24] David J. Wu,et al. Secure genome-wide association analysis using multiparty computation , 2018, Nature Biotechnology.

[25] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[26] T Greco,et al. Review Article , 2022 .

[27] Carson C Chow,et al. Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[28] D. Reich,et al. Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[29] Christopher R. Gignoux,et al. Human demographic history impacts genetic risk prediction across diverse populations , 2016, bioRxiv.

[30] Kathryn Roeder,et al. A Method to Exploit the Structure of Genetic Ancestry Space to Enhance Case-Control Studies. , 2016, American journal of human genetics.

[31] David C Hoaglin,et al. We know less than we should about methods of meta‐analysis , 2015, Research synthesis methods.

[32] N. Wray,et al. A mega-analysis of genome-wide association studies for major depressive disorder , 2013, Molecular Psychiatry.

[33] Inês Barroso,et al. Genome-Wide Association Identifies Nine Common Variants Associated With Fasting Proinsulin Levels and Provides New Insights Into the Pathophysiology of Type 2 Diabetes , 2011, Diabetes.

[34] P. Gregersen,et al. Accounting for ancestry: population substructure and genome-wide association studies. , 2008, Human molecular genetics.

[35] Chris Clifton,et al. Tools for privacy preserving distributed data mining , 2002, SKDD.

[36] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[37] Peter Lubell-Doughtie,et al. Practical distributed classification using the Alternating Direction Method of Multipliers algorithm , 2013, 2013 IEEE International Conference on Big Data.

[38] Ayellet V. Segrè,et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis , 2010, Nature Genetics.

[39] R. Glowinski,et al. Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problèmes de Dirichlet non linéaires , 1975 .

[40] P. Visscher,et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[41] Yiming Yang,et al. Distributed training of Large-scale Logistic models , 2013, ICML.

[42] Bonnie Berger,et al. Enabling Privacy Preserving GWAS in Heterogeneous Human Populations , 2016, RECOMB.