CLARITE Facilitates the Quality Control and Analysis Process for EWAS of Metabolic-Related Traits

While genome-wide association studies are an established method of identifying genetic variants associated with disease, environment-wide association studies (EWAS) highlight the contribution of nongenetic components to complex phenotypes. However, the lack of high-throughput quality control (QC) pipelines for EWAS data lends itself to analysis plans where the data are cleaned after a first-pass analysis, which can lead to bias, or are cleaned manually, which is arduous and susceptible to user error. We offer a novel software, CLeaning to Analysis: Reproducibility-based Interface for Traits and Exposures (CLARITE), as a tool to efficiently clean environmental data, perform regression analysis, and visualize results on a single platform through user-guided automation. It exists as both an R package and a Python package. Though CLARITE focuses on EWAS, it is intended to also improve the QC process for phenotypes and clinical lab measures for a variety of downstream analyses, including phenome-wide association studies and gene-environment interaction studies. With the goal of demonstrating the utility of CLARITE, we performed a novel EWAS in the National Health and Nutrition Examination Survey (NHANES) (N overall Discovery=9063, N overall Replication=9874) for body mass index (BMI) and over 300 environment variables post-QC, adjusting for sex, age, race, socioeconomic status, and survey year. The analysis used survey weights along with cluster and strata information in order to account for the complex survey design. Sixteen BMI results replicated at a Bonferroni corrected p < 0.05. The top replicating results were serum levels of g-tocopherol (vitamin E) (Discovery Bonferroni p: 8.67x10-12, Replication Bonferroni p: 2.70x10-9) and iron (Discovery Bonferroni p: 1.09x10-8, Replication Bonferroni p: 1.73x10-10). Results of this EWAS are important to consider for metabolic trait analysis, as BMI is tightly associated with these phenotypes. As such, exposures predictive of BMI may be useful for covariate and/or interaction assessment of metabolic-related traits. CLARITE allows improved data quality for EWAS, gene-environment interactions, and phenome-wide association studies by establishing a high-throughput quality control infrastructure. Thus, CLARITE is recommended for studying the environmental factors underlying complex disease.

[1]  R. Jain,et al.  Effect of body mass index and total blood volume on serum cotinine levels among cigarette smokers: NHANES 1999-2008. , 2010, Clinica chimica acta; international journal of clinical chemistry.

[2]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[3]  Wendy A. Wolf,et al.  Public and Biobank Participant Attitudes toward Genetic Research Participation and Data Sharing , 2010, Public Health Genomics.

[4]  Jian Zhang,et al.  Associations between body mass index and the prevalence of low micronutrient levels among US adults. , 2006, MedGenMed : Medscape general medicine.

[5]  T. Söylemezoǧlu,et al.  Effects of age, gender, BMI, settlement and smoking on lead and cadmium accumulation in heart tissue - , 2017 .

[6]  David S. Wishart,et al.  Recommended strategies for spectral processing and post-processing of 1D 1H-NMR data of biofluids with a particular focus on urine , 2018, Metabolomics.

[7]  Peter Kraft,et al.  Quality control and quality assurance in genotypic data for genome‐wide association studies , 2010, Genetic epidemiology.

[8]  Dana C Crawford,et al.  Pitfalls of merging GWAS data: lessons learned in the eMERGE network and quality control procedures to maintain high data quality , 2011, Genetic epidemiology.

[9]  Thomas Lumley,et al.  Analysis of Complex Survey Samples , 2004 .

[10]  John S. Brownstein,et al.  Environment-Wide Association Study of Blood Pressure in the National Health and Nutrition Examination Survey (1999–2012) , 2016, Scientific Reports.

[11]  Roger Eeckels,et al.  Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities , 2005, PLoS medicine.

[12]  Yeyi Zhu,et al.  Data Acquisition and Preprocessing in Studies on Humans: What is Not Taught in Statistics Classes? , 2013, The American statistician.

[13]  E. Norton,et al.  Alcohol consumption and body weight. , 2009, Health economics.

[14]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[15]  Anurag Verma,et al.  PLATO software provides analytic framework for investigating complexity beyond genome-wide association studies , 2017, Nature Communications.

[16]  Marylyn D. Ritchie,et al.  Imputation and quality control steps for combining multiple genome-wide datasets , 2014, Front. Genet..

[17]  Carmen R. Wilson VanVoorhis,et al.  Understanding Power and Rules of Thumb for Determining Sample Sizes , 2007 .

[18]  Alex Brito,et al.  Body mass index, iron absorption and iron status in childbearing age women. , 2015, Journal of trace elements in medicine and biology : organ of the Society for Minerals and Trace Elements.

[19]  J. Pell,et al.  Relationship between Smoking and Obesity: A Cross-Sectional Study of 499,504 Middle-Aged Adults in the UK General Population , 2015, PloS one.

[20]  Atul J. Butte,et al.  An Environment-Wide Association Study (EWAS) on Type 2 Diabetes Mellitus , 2010, PloS one.

[21]  Edsel A. Peña,et al.  Global Validation of Linear Model Assumptions , 2006, Journal of the American Statistical Association.

[22]  Harvey J Motulsky,et al.  Common misconceptions about data analysis and statistics1 , 2014, Pharmacology research & perspectives.

[23]  Roger D. Peng,et al.  The reproducibility crisis in science: A statistical counterattack , 2015 .

[24]  Lutgarde M. C. Buydens,et al.  Breaking with trends in pre-processing? , 2013 .

[25]  Cartik R. Kothari,et al.  A database of human exposomes and phenomes from the US National Health and Nutrition Examination Survey , 2016, Scientific Data.

[26]  Ikenna C. Eze,et al.  Insufficient Fruit and Vegetable Intake in a Low- and Middle-Income Setting: A Population-Based Survey in Semi-Urban Tanzania , 2018, Nutrients.

[27]  Jenny Chang-Claude,et al.  Gene–environment interactions for complex traits: definitions, methodological requirements and challenges , 2008, European Journal of Human Genetics.

[28]  Dana C Crawford,et al.  Environment-wide association study (EWAS) for type 2 diabetes in the Marshfield Personalized Medicine Research Project Biobank. , 2013, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[29]  Helen E. Parkinson,et al.  The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) , 2016, Nucleic Acids Res..

[30]  Lester R Curtin,et al.  National health and nutrition examination survey: analytic guidelines, 1999-2010. , 2013, Vital and health statistics. Series 2, Data evaluation and methods research.

[31]  Sally R. Ellingson,et al.  Automated quality control for genome wide association studies , 2016, F1000Research.

[32]  Lars Lind,et al.  An environmental wide association study (EWAS) approach to the metabolic syndrome. , 2013, Environment international.

[33]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[34]  David Levine,et al.  A high-performance computing toolset for relatedness and principal component analysis of SNP data , 2012, Bioinform..

[35]  U. Nöthlings,et al.  Vitamin E (α- and γ-Tocopherol) Levels in the Community: Distribution, Clinical and Biochemical Correlates, and Association with Dietary Patterns , 2017, Nutrients.

[36]  R. Jirtle,et al.  Epigenetics, obesity and early-life cadmium or lead exposure , 2016, Epigenomics.

[37]  L. Migliore,et al.  Epigenetics of Obesity. , 2016, Progress in molecular biology and translational science.

[38]  M. Jensen,et al.  Blood lead level and its association with body mass index and obesity in China - Results from SPECT-China study , 2015, Scientific Reports.

[39]  Rongling Li,et al.  Quality Control Procedures for Genome‐Wide Association Studies , 2011, Current protocols in human genetics.

[40]  Xiaodong Zhuang,et al.  Environment-wide association study to identify novel factors associated with peripheral arterial disease: Evidence from the National Health and Nutrition Examination Survey (1999-2004). , 2018, Atherosclerosis.