Signals Among Signals: Prioritizing Nongenetic Associations in Massive Data Sets

Abstract Massive data sets are often regarded as a panacea to the underpowered studies of the past. At the same time, it is becoming clear that in many of these data sets in which thousands of variables are measured across hundreds of thousands or millions of individuals, almost any desired relationship can be inferred with a suitable combination of covariates or analytic choices. Inspired by the genome-wide association study analysis paradigm that has transformed human genetics, X-wide association studies or “XWAS” have emerged as a popular approach to systematically analyzing nongenetic data sets and guarding against false positives. However, these studies often yield hundreds or thousands of associations characterized by modest effect sizes and miniscule P values. Many of these associations will be spurious and emerge due to confounding and other biases. One way of characterizing confounding in the genomics paradigm is the genomic inflation factor. An analogous “X-wide inflation factor,” denoted λX, can be defined and applied to published XWAS. Effects that arise in XWAS may be prioritized using replication, triangulation, quantification of measurement error, contextualization of each effect in the distribution of all effect sizes within a field, and pre-registration. Criteria like those of Bradford Hill need to be reconsidered in light of exposure-wide epidemiology to prioritize signals among signals.

[1]  John P A Ioannidis,et al.  Exposure‐wide epidemiology: revisiting Bradford Hill , 2016, Statistics in medicine.

[2]  M. Fallin,et al.  Is "X"-WAS the future for all of epidemiology? , 2011, Epidemiology.

[3]  V. Prasad,et al.  Prespecified falsification end points: can they validate true observational associations? , 2013, JAMA.

[4]  John P. A. Ioannidis,et al.  How to Make More Published Research True , 2014, PLoS medicine.

[5]  A. B. Hill The Environment and Disease: Association or Causation? , 1965, Proceedings of the Royal Society of Medicine.

[6]  C. Wild,et al.  The exposome: from concept to utility. , 2012, International journal of epidemiology.

[7]  Ole A. Andreassen,et al.  The Impact of Divergence Time on the Nature of Population Structure: An Example from Iceland , 2009, PLoS genetics.

[8]  Atul J. Butte,et al.  A Nutrient-Wide Association Study on Blood Pressure , 2012, Circulation.

[9]  Atul J. Butte,et al.  An Environment-Wide Association Study (EWAS) on Type 2 Diabetes Mellitus , 2010, PloS one.

[10]  T. Manolio,et al.  How to Interpret a Genome-wide Association Study Topic Collections , 2022 .

[11]  Isaac S Kohane,et al.  Systematic correlation of environmental exposure and physiological and self-reported behaviour factors with leukocyte telomere length , 2016, International journal of epidemiology.

[12]  Chirag J. Patel,et al.  Systematic identification of correlates of HIV infection: an X-wide association study , 2018, AIDS.

[13]  Chirag J. Patel,et al.  Development of Exposome Correlations Globes to Map Out Environment-Wide Associations , 2014, Pacific Symposium on Biocomputing.

[14]  Mark I McCarthy,et al.  Genomic inflation factors under polygenic inheritance , 2011, European Journal of Human Genetics.

[15]  J. Ioannidis,et al.  Nationwide Population Science: Lessons From the Taiwan National Health Insurance Research Database. , 2015, JAMA internal medicine.

[16]  Zachary A. Capshaw,et al.  Applying the Bradford Hill criteria in the 21st century: how data integration has changed causal inference in molecular epidemiology , 2015, Emerging Themes in Epidemiology.

[17]  M. Daly,et al.  LD Score regression distinguishes confounding from polygenicity in genome-wide association studies , 2014, Nature Genetics.

[18]  John P A Ioannidis,et al.  Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations. , 2015, Journal of clinical epidemiology.

[19]  John P A Ioannidis,et al.  Researching Genetic Versus Nongenetic Determinants of Disease: A Comparison and Proposed Unification , 2009, Science Translational Medicine.

[20]  P. Visscher,et al.  Five years of GWAS discovery. , 2012, American journal of human genetics.

[21]  John P A Ioannidis,et al.  Studying the elusive environment in large scale. , 2014, JAMA.

[22]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[23]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[24]  Molly A. Hall,et al.  Informatics and Data Analytics to Support Exposome-Based Discovery for Public Health. , 2017, Annual review of public health.

[25]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[26]  Peter Szolovits,et al.  Genetic Misdiagnoses and the Potential for Health Disparities. , 2016, The New England journal of medicine.

[27]  John P. A. Ioannidis,et al.  Big data meets public health , 2014, Science.

[28]  M. Munafo,et al.  Robust research needs many lines of evidence , 2018, Nature.

[29]  D. Madigan,et al.  Medication-Wide Association Studies , 2013, CPT: pharmacometrics & systems pharmacology.

[30]  John P. A. Ioannidis,et al.  Systematic assessment of pharmaceutical prescriptions in association with cancer risk: a method to conduct a population-wide medication-wide longitudinal study , 2016, Scientific Reports.