Missing data interpolation in integrative multi-cohort analysis with disparate covariate information

Integrative analysis of datasets generated by multiple cohorts is a widely-used approach for increasing sample size, precision of population estimators, and generalizability of analysis results in epidemiological studies. However, often each individual cohort dataset does not have all variables of interest for an integrative analysis collected as a part of an original study. Such cohort-level missingness poses methodological challenges to the integrative analysis since missing variables have traditionally: (1) been removed from the data for complete case analysis; or (2) been com-pleted by missing data interpolation techniques using data with the same covariate distribution from other studies. In most integrative-analysis studies, neither approach is optimal as it leads to either loosing the majority of study covariates or challenges in specifying the cohorts following the same distributions. We propose a novel approach to identify the studies with same distributions that could be used for completing the cohort-level missing information. Our methodology relies on (1) identifying sub-groups of cohorts with similar covariate distributions using cohort iden-tity random forest prediction models followed by clustering; and then (2) applying a recursive pairwise distribution test for high dimensional data to these sub-groups. Extensive simulation studies show that cohorts with the same distribution are correctly grouped together in almost all simulation settings. Our methods’ application to two ECHO-wide Cohort Studies reveals that the cohorts grouped together reflect the similarities in study design. The methods are implemented in R software package relate .

[1]  J. Stanford,et al.  Associations between combined exposure to environmental hazards and social stressors at the neighborhood level and individual perinatal outcomes in the ECHO-wide cohort. , 2022, Health & place.

[2]  Susan L. Johnson,et al.  Cardiometabolic Pregnancy Complications in Association with Autism-Related Traits as Measured by the Social Responsiveness Scale in ECHO. , 2022, American journal of epidemiology.

[3]  N. Chatterjee,et al.  Generalized meta-analysis for multiple regression models across studies with disparate covariate information. , 2017, Biometrika.

[4]  P. Legendre Numerical Ecology , 2019, Encyclopedia of Ecology.

[5]  Bryan Lau,et al.  Collaborative, pooled and harmonized study designs for epidemiologic research: challenges and opportunities. , 2018, International journal of epidemiology.

[6]  Ashley I. Naimi,et al.  Stacked generalization: an introduction to super learning , 2017, bioRxiv.

[7]  George Alexeeff,et al.  Racial/Ethnic Disparities in Cumulative Environmental Health Impacts in California: Evidence From a Statewide Environmental Justice Screening Tool (CalEnviroScreen 1.1). , 2015, American journal of public health.

[8]  Anil K. Ghosh,et al.  A nonparametric two-sample test applicable to high dimensional data , 2014, J. Multivar. Anal..

[9]  Constantine Frangakis,et al.  Multiple imputation by chained equations: what is it and how does it work? , 2011, International journal of methods in psychiatric research.

[10]  Elizabeth A Stuart,et al.  American Journal of Epidemiology Practice of Epidemiology Multiple Imputation with Large Data Sets: a Case Study of the Children's Mental Health Initiative , 2022 .

[11]  R. Perera Research methods journal club: a gentle introduction to imputation of missing values , 2008, Evidence-based medicine.

[12]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[13]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[14]  M. P. Da Silva Leal [A children's hospital]. , 1946, Accao medica.