Genome-and Phenome-Wide Analysis of Cardiac Conduction Identifies Markers of Arrhythmia Risk Running title : Ritchie et al . ; QRS GWAS and PheWAS in electronic records

ers at Northwestern and Marshfield reviewed randomly-selected subsets of 100 subjects at Marshfield and 45 subjects at Northwestern to determine the algorithm’s accuracy at external sites. Northwestern’s evaluation also included an independent review by a board-certified by gest on A ril 7, 2017 http://ciajournals.org/ D ow nladed from DOI: 10.1161/CIRCULATIONAHA.112.000604 6 internal medicine physician, with discrepancies resolved by consensus. This study included only subjects designated as “non-Hispanic white” European American in the EMR from each site. We have previously shown the EMR ancestry performs similar to self-report. This study was approved by each site’s Institutional Review Board. Because BioVU is de-identified and accrues individuals through left-over blood remaining after routine clinical testing, it operates as non-human subjects research according to the provisions of 45 CFR 46, as described previously. Individuals at other eMERGE sites were consented as part of each site’s DNA biobank. Genotyping and data analysis Genotyping was performed at the Center for Genotyping and Analysis at the Broad Institute and the Center for Inherited Disease Research (CIDR) at Johns Hopkins University. Samples of European ancestry or unknown ancestry were analyzed using the Illumina Human660WQuadv1_A genotyping platform, consisting of 561,490 SNPs and 95,876 intensity-only probes. Data were cleaned using the quality control (QC) pipeline developed by the eMERGE Genomics Working Group. This process includes evaluation of sample and marker call rate, gender mismatch and anomalies, duplicate and HapMap concordance, batch effects, Hardy-Weinberg equilibrium (HWE), sample relatedness, and population stratification. After QC, 528,508 SNPs were used for analysis based on the following QC criteria: SNP call rate >99%, sample call rate >99%, minor allele frequency > 0.0001, unrelated samples only (removing all parent-offspring, full and half siblings), and individuals of European-descent only (based on STRUCTURE analysis of >90% probability of being in the CEU cluster). Each eMERGE site used the QC pipeline to clean their initial datasets prior to merging all the samples. QC procedures were then performed on the merged eMERGE dataset in which by gest on A ril 7, 2017 http://ciajournals.org/ D ow nladed from DOI: 10.1161/CIRCULATIONAHA.112.000604 7 data from all five sites were combined, and no significant differences across sites or genotyping center were identified. As well, all sites had comparable QC results including similar SNP and sample call rates, HWE p-values overall, and minor allele frequencies. The detailed QC report on the merged dataset will be deposited in dbGaP along with the merged dataset. Single-locus tests of association were performed using linear regression assuming an additive genetic model for all 528,508 SNPs in a total of 5,272 individuals with a normal QRS duration. Our studies of ECG intervals in 32,949 normal individuals identified sex as a major modulator of normal QRS duration, with minor effects of age and ancestry. All analyses were performed unadjusted and then adjusted for age, sex, BMI, and the first principal component from Eigenstrat to adjust for potential population stratification, without significantly changing the key results. Since only sex is significantly associated with QRS duration via the literature, we report that here. Analyses were also performed adjusting for height and/or BMI, but these did not change the results. Associations with the lowest P values (p<10) were then submitted to the recent QRS Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) meta-analysis group and a table of P-values from that analysis was generated. The CHARGE meta-analysis of QRS duration has been described in detail previously; briefly, it involved 40,407 individuals selected from 15 sites restricted to those of European ancestry. Individuals with prior myocardial infarction, heart failure, arrhythmias, pacemakers, antiarrhythmic medication use, or whose QRS durations >120 ms were excluded. To calculate the variance explained by all SNPs in the dataset, we analyzed the data using GCTA. Only those SNPs with minor allele frequency > 0.01, genotyping efficiency > 99.9 and HWE > 0.001 were included in the analysis (n=505,502 SNPs). The genetic relationship matrix by gest on A ril 7, 2017 http://ciajournals.org/ D ow nladed from DOI: 10.1161/CIRCULATIONAHA.112.000604 8 (GRM) was computed for all 5272 subjects and all SNPs using GCTA. In order to eliminate possible cryptic relationships, subjects with GRM>0.25 were pruned from the analysis, which removed 310 subjects. The proportion of variance explained by either all SNPs or all SNPs excluding the subset of 23 SNPs significant in the CHARGE GWAS was computed on the remaining subjects for QRS duration. We compared this to a linear regression analysis using the five SNPs in Table 2 to estimate the proportion of variance explained by these loci. All analyses were adjusted for age, gender and the first principal component (previously computed). Phenome-wide association study of QRS-associated SNPs We selected the most significant SNP associations for analysis by PheWAS. For this analysis, we combined the entire eMERGE cohort of European American individuals (n=13,859) identified across the five eMERGE sites. These individuals represent a superset of the 5,272 individuals with normal ECGs and without heart disease used for the GWAS. To define diseases, we queried all International Classification of Disease (ICD), 9 edition, codes from the respective EMRs of the five eMERGE sites. The PheWAS software uses occurrences of ICD codes to classify each person as having one or more of 778 possible clinical phenotypes (typically diseases). For each disease, the PheWAS algorithm constructs a control population by selecting all patients that do not have the case disease or closely related diseases (e.g., a patient with a bundle branch block cannot serve as a control for complete heart block). The PheWAS methodology has previously been validated through rediscovery of known associations. Analysis of each phenotype then proceeds using a pairwise analysis of all case and control groups for each tested SNP (n=23). We have observed that positive predictive values increase when individual codes are present more than once in the EMR, and here we required each case to have at least four instances of the same ICD code in a by gest on A ril 7, 2017 http://ciajournals.org/ D ow nladed from DOI: 10.1161/CIRCULATIONAHA.112.000604 9 PheWAS case group. In addition, we did not analyze phenotypes occurring in less than 50 patients (a prevalence of 0.36% in the dataset). Association analyses were performed with PLINK using logistic regression adjusted for age, gender, and the first three principal component analyses as calculated by Eigenstrat, since on this larger population, the third principal component was statistically significant. Analysis adjusted with and without principal components did not substantively change the results. After identification of PheWAS case and control groups using the PheWAS software, the association analyses were performed using PLINK. Survival analysis of QRS population Following PheWAS analysis, we analyzed the original set of 5272 patients that met our algorithm definition for normal cardiac conduction/normal heart for subsequent development of atrial fibrillation and cardiac arrhythmias with the SCN5A rs1805126 and SCN10A rs6795970 SNPs. Phenotype definitions were drawn from the PheWAS analysis using billing codes. Kaplan-Meier analysis and Cox proportional hazard models were calculated, using the starting time as the initial normal ECG with a time-to-event analysis. Cox proportional hazard models were adjusted for age, sex, principal components as calculated above, and QRS duration. Results Population identification We identified 5,272 Caucasian patients (2,488 males and 2,784 females; Table 1) across the five eMERGE-I sites. The positive predictive value (PPV) of the automated phenotype algorithm to find cases with normal ECGs and without exclusions at the development site, Vanderbilt, to identify study subjects was 97% (95% confidence interval [CI] 91-99%). The PPV at by gest on A ril 7, 2017 http://ciajournals.org/ D ow nladed from DOI: 10.1161/CIRCULATIONAHA.112.000604 10 Northwestern University and Marshfield Clinic were 97% (95% CI 83%-100%) and 100% (95% CI 96%-100%), respectively. Combining all reviewed samples across the three sites, the PPV would be 98% (95% CI 96%-100%). The mean QRS duration was 87.9 msec (standard deviation 9.5 msec; median 88.0 msec; Figure 1A). GWAS results A total of 528,508 SNPs passed quality control of eMERGE-supported Illumina 660Quad genotyping data in these subjects. Figure 1B shows the genome-wide association analysis for QRS duration adjusted for sex; the findings were near-identical for the unadjusted analysis. There was a single association between QRS duration and a SNP (rs1805126) in SCN5A, encoding the cardiac sodium channel gene, that survived Bonferroni correction (beta=1.002 msec per copy of the T allele, p=1.45 x 10). The set taken forward to the CHARGE QRS meta-analysis consortium included 108 SNPs with P-values <10. The retrieved P-values for this set divided into two distinct groups: 23 SNPs with P-values in the CHARGE set from 10 to 10, and 85 with P-values > 0.003. These 23 associations (Supplementary Table 1) are located in the five loci with the lowest P values reported by the CHARGE consortium: 18/23 are in the chromosome 3 locus that includes SCN5A and SCN10A, as well as other genes (e.g. EXOG and XYLB1). The other three loci are near SLC35F1 and C6orf204 (chromosome 6), near CDKN1A (chromosome 6) and in NFIA (chromosome 1). The most significant SNP for each locus is presented in Table 2. The locus zoom plot (Supplementary Figure 1) shows little linkage disequilibrium

[1]  J. Elmquist,et al.  Genetic tracing of Nav1.8‐expressing vagal afferents in the mouse , 2011, The Journal of comparative neurology.

[2]  Wataru Shimizu,et al.  Brugada syndrome: report of the second consensus conference. , 2005, Heart rhythm.

[3]  E. Johnson,et al.  The differential effect of quinidine and pyrilamine on the myocardial action potential at various rates of stimulation. , 1957, The Journal of pharmacology and experimental therapeutics.

[4]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[5]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[6]  H. Knoblauch,et al.  QT interval is linked to 2 long-QT syndrome loci in normal subjects. , 1999, Circulation.

[7]  Rongling Li,et al.  Quality Control Procedures for Genome‐Wide Association Studies , 2011, Current protocols in human genetics.

[8]  A L Waldo,et al.  Events in the cardiac arrhythmia suppression trial: baseline predictors of mortality in placebo-treated patients. , 1991, Journal of the American College of Cardiology.

[9]  J. Coromilas,et al.  Electrophysiological effects of flecainide on anisotropic conduction and reentry in infarcted canine hearts. , 1995, Circulation.

[10]  Christopher G Chute,et al.  Complement receptor 1 gene variants are associated with erythrocyte sedimentation rate. , 2011, American journal of human genetics.

[11]  D. Roden,et al.  Blocking Scn10a Channels in Heart Reduces Late Sodium Current and Is Antiarrhythmic , 2012, Circulation research.

[12]  Suzette J. Bielinski,et al.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study , 2012, J. Am. Medical Informatics Assoc..

[13]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[14]  Geoffrey S Ginsburg,et al.  Centralized biorepositories for genetic and genomic research. , 2008, JAMA.

[15]  Wendy A. Wolf,et al.  Public and Biobank Participant Attitudes toward Genetic Research Participation and Data Sharing , 2010, Public Health Genomics.

[16]  Norman Fost,et al.  Community consultation and communication for a population‐based DNA biobank: The Marshfield clinic personalized medicine research project , 2008, American journal of medical genetics. Part A.

[17]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[18]  Randolph A. Miller,et al.  Research Paper: Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents , 2009, J. Am. Medical Informatics Assoc..

[19]  Melissa A. Basford,et al.  Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. , 2011, American journal of human genetics.

[20]  R. Peters,et al.  Interaction of ischaemia and encainide/flecainide treatment: a proposed mechanism for the increased mortality in CAST I. , 1995, British heart journal.

[21]  W. Rogers,et al.  Mortality following ventricular arrhythmia suppression by encainide, flecainide, and moricizine after myocardial infarction. The original design concept of the Cardiac Arrhythmia Suppression Trial (CAST). , 1993, JAMA.

[22]  C. Chute,et al.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium , 2011, Science Translational Medicine.

[23]  Mark N. Wass,et al.  Genetic variation in SCN10A influences cardiac conduction , 2010, Nature Genetics.

[24]  Peter Szolovits,et al.  Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. , 2011, American journal of human genetics.

[25]  Kari Stefansson,et al.  Several common variants modulate heart rate, PR interval and QRS duration , 2010, Nature Genetics.

[26]  Melissa A. Basford,et al.  Identification of Genomic Predictors of Atrioventricular Conduction: Using Electronic Medical Records as a Tool for Genome Science , 2010, Circulation.

[27]  Jingyuan Fu,et al.  Common variants in 22 loci are associated with QRS duration and cardiac ventricular conduction , 2010, Nature Genetics.

[28]  B. London Whither art thou, SCN10A, and what art thou doing? , 2012, Circulation research.

[29]  D. Lykken,et al.  Genetic factors in the electrocardiogram and heart rate of twins reared apart and together. , 1989, The American journal of cardiology.

[30]  Christopher G. Chute,et al.  A Genome-Wide Association Study of Red Blood Cell Traits Using the Electronic Medical Record , 2010, PloS one.

[31]  Joshua C Denny,et al.  Modulators of normal electrocardiographic intervals identified in a large electronic medical record. , 2011, Heart rhythm.

[32]  B. de Jonge,et al.  Functional NaV1.8 Channels in Intracardiac Neurons: The Link Between SCN10A and Cardiac Electrophysiology , 2012, Circulation research.

[33]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[34]  Joshua C Denny,et al.  Assessing the accuracy of observer-reported ancestry in a biorepository linked to electronic medical records , 2010, Genetics in Medicine.

[35]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[36]  Christian Gieger,et al.  Genome-wide association study of PR interval , 2010, Nature Genetics.

[37]  Y Pawitan,et al.  Increased risk of death and cardiac arrest from encainide and flecainide in patients after non-Q-wave acute myocardial infarction in the Cardiac Arrhythmia Suppression Trial. CAST Investigators. , 1991, The American journal of cardiology.

[38]  C. McCarty,et al.  Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. , 2005, Personalized medicine.

[39]  Randolph A. Miller,et al.  Identifying QT prolongation from ECG impressions using a general-purpose Natural Language Processor , 2009, Int. J. Medical Informatics.

[40]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[41]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.