Electronic health records: the next wave of complex disease genetics.

The combination of electronic health records (EHRs) with genetic data has ushered in the next wave of complex disease genetics. Population-based biobanks and other large cohorts provide sufficient sample sizes to identify novel genetic associations across the hundreds to thousands of phenotypes gleaned from EHRs. In this review, we summarize the current state of these EHR-linked biobanks, explore ongoing methods development in the field and highlight recent discoveries of genetic associations. We enumerate the many existing biobanks with EHRs linked to genetic data, many of which are available to researchers via application and contain sample sizes >50 000. We also discuss the computational and statistical considerations for analysis of such large datasets including mixed models, phenotype curation and cloud computing. Finally, we demonstrate how genome-wide association studies and phenome-wide association studies have identified novel genetic findings for complex diseases, specifically cardiometabolic traits. As more researchers employ innovative hypotheses and analysis approaches to study EHR-linked biobanks, we anticipate a richer understanding of the genetic etiology of complex diseases.

[1]  Y. Kamatani,et al.  Overview of the BioBank Japan Project: Study design and profile , 2017, Journal of epidemiology.

[2]  K. Hveem,et al.  COHORT PROFILE Cohort Profile : The HUNT Study , Norway , 2013 .

[3]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[4]  W. Kittanamongkolchai,et al.  Risk of coronary artery disease in patients with ankylosing spondylitis: a systematic review and meta-analysis. , 2015, Annals of translational medicine.

[5]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[6]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[7]  Esben Agerbo,et al.  Polygenic Risk Score, Parental Socioeconomic Status, Family History of Psychiatric Disorders, and the Risk for Schizophrenia: A Danish Population-Based Study and Meta-analysis. , 2015, JAMA psychiatry.

[8]  Clara Diaz,et al.  Identifying large sets of unrelated individuals and unrelated markers , 2014, Source Code for Biology and Medicine.

[9]  Tanya M. Teslovich,et al.  Genome-wide association study of 1 million people identifies 111 loci for atrial fibrillation , 2018, bioRxiv.

[10]  E. Clayton,et al.  Principles of Human Subjects Protections Applied in an Opt‐Out, De‐identified Biobank , 2010, Clinical and translational science.

[11]  Joshua C. Denny,et al.  Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies , 2017 .

[12]  Bjarni V. Halldórsson,et al.  Large-scale whole-genome sequencing of the Icelandic population , 2015, Nature Genetics.

[13]  N. Risch,et al.  Genetic Contributors to Variation in Alcohol Consumption Vary by Race/Ethnicity in a Large Multi-Ethnic Genome-wide Association Study , 2017, Molecular Psychiatry.

[14]  Mary Brophy,et al.  Million Veteran Program: A mega-biobank to study genetic influences on health and disease. , 2016, Journal of clinical epidemiology.

[15]  Hynek Pikhart,et al.  PCSK9 genetic variants and risk of type 2 diabetes: a mendelian randomisation study , 2017, The lancet. Diabetes & endocrinology.

[16]  J. Danesh,et al.  Association analyses based on false discovery rate implicate new loci for coronary artery disease , 2017, Nature Genetics.

[17]  Raquel S. Sevilla,et al.  Exome-wide association study of plasma lipids in >300,000 individuals , 2017, Nature Genetics.

[18]  Melissa A. Basford,et al.  The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future , 2013, Genetics in Medicine.

[19]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[20]  Marylyn D. Ritchie,et al.  Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study , 2016, Science.

[21]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[22]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[23]  G. Abecasis,et al.  Improving power of association tests using multiple sets of imputed genotypes from distributed reference panels , 2017, Genetic epidemiology.

[24]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[25]  Teresa A. Webster,et al.  Genotyping Informatics and Quality Control for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort , 2015, Genetics.

[26]  M. Daly,et al.  An Atlas of Genetic Correlations across Human Diseases and Traits , 2015, Nature Genetics.

[27]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[28]  J. Gulcher,et al.  An Icelandic saga on a centralized healthcare database and democratic decision making , 1999, Nature Biotechnology.

[29]  Mulin Jun Li,et al.  Nature Genetics Advance Online Publication a N a Ly S I S the Support of Human Genetic Evidence for Approved Drug Indications , 2022 .

[30]  M. Kanai,et al.  Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases , 2018, Nature Genetics.

[31]  Cathie Sudlow,et al.  UK Biobank: opportunities for cardiovascular research , 2017, European heart journal.

[32]  Bjarni V. Halldórsson,et al.  The nature of nurture: Effects of parental genotypes , 2017, Science.

[33]  Ivo D. Dinov,et al.  Methodological challenges and analytic opportunities for modeling and interpreting Big Healthcare Data , 2016, GigaScience.

[34]  Tom R. Gaunt,et al.  Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel , 2015, Nature Communications.

[35]  GWAS on family history of Alzheimer’s disease , 2018 .

[36]  Seunggeun Lee,et al.  A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS. , 2017, American journal of human genetics.

[37]  C. Hoggart,et al.  Genome‐wide significance for dense SNP and resequencing data , 2008, Genetic epidemiology.

[38]  Tanya M. Teslovich,et al.  Genome-wide Study of Atrial Fibrillation Identifies Seven Risk Loci and Highlights Biological Pathways and Regulatory Elements Involved in Cardiac Development. , 2018, American journal of human genetics.

[39]  Tian Ge,et al.  Phenome-wide heritability analysis of the UK Biobank , 2016, bioRxiv.

[40]  Marcelo P. Segura-Lepe,et al.  Protein-altering variants associated with body mass index implicate pathways that control energy intake and expenditure underpinning obesity , 2017, Nature Genetics.

[41]  Pim van der Harst,et al.  Identification of 64 Novel Genetic Loci Provides an Expanded View on the Genetic Architecture of Coronary Artery Disease , 2017, Circulation research.

[42]  Jie Huang,et al.  Large-scale genome-wide analysis identifies genetic variants associated with cardiac structure and function , 2017, The Journal of clinical investigation.

[43]  Blair H. Smith,et al.  Genome-wide Association for Major Depression Through Age at Onset Strati fi cation: Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium , 2016 .

[44]  P. Donnelly,et al.  Genome-wide genetic data on ~500,000 UK Biobank participants , 2017, bioRxiv.

[45]  Daniel R. Lavage,et al.  Association of Rare and Common Variation in the Lipoprotein Lipase Gene With Coronary Artery Disease , 2017, JAMA.

[46]  Yaniv Erlich,et al.  Case–control association mapping by proxy using family history of disease , 2017, Nature Genetics.

[47]  Alexander E. Lopez,et al.  Inactivating Variants in ANGPTL4 and Risk of Coronary Artery Disease. , 2016, The New England journal of medicine.

[48]  Marylyn D. Ritchie,et al.  Genetic identification of familial hypercholesterolemia within a single U.S. health care system , 2016, Science.

[49]  A. Kong,et al.  The genealogic approach to human genetics of disease. , 2001, Cancer journal.

[50]  T. Esko,et al.  Variants near CHRNA3/5 and APOE have age- and sex-related effects on human lifespan , 2016, Nature Communications.

[51]  Jana K. Shirey-Rice,et al.  Using Human ‘Experiments of Nature’ to Predict Drug Safety Issues: An Example with PCSK9 Inhibitors , 2018, Drug Safety.

[52]  Andrew D. Johnson,et al.  Novel Blood Pressure Locus and Gene Discovery Using Genome-Wide Association Study and Expression Data Sets From Blood and the Kidney , 2017, Hypertension.

[53]  R. Collins,et al.  A phenome-wide association study of a lipoprotein-associated phospholipase A2 loss-of-function variant in 90 000 Chinese adults , 2016, International Journal of Epidemiology.

[54]  R. Collins,et al.  China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. , 2011, International journal of epidemiology.

[55]  B. Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014, Nature Genetics.

[56]  N. Cox,et al.  Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record , 2017, PloS one.