Exploring the impact of selection bias in observational studies of COVID-19: a simulation study

Abstract Background Non-random selection of analytic subsamples could introduce selection bias in observational studies. We explored the potential presence and impact of selection in studies of SARS-CoV-2 infection and COVID-19 prognosis. Methods We tested the association of a broad range of characteristics with selection into COVID-19 analytic subsamples in the Avon Longitudinal Study of Parents and Children (ALSPAC) and UK Biobank (UKB). We then conducted empirical analyses and simulations to explore the potential presence, direction and magnitude of bias due to this selection (relative to our defined UK-based adult target populations) when estimating the association of body mass index (BMI) with SARS-CoV-2 infection and death-with-COVID-19. Results In both cohorts, a broad range of characteristics was related to selection, sometimes in opposite directions (e.g. more-educated people were more likely to have data on SARS-CoV-2 infection in ALSPAC, but less likely in UKB). Higher BMI was associated with higher odds of SARS-CoV-2 infection and death-with-COVID-19. We found non-negligible bias in many simulated scenarios. Conclusions Analyses using COVID-19 self-reported or national registry data may be biased due to selection. The magnitude and direction of this bias depend on the outcome definition, the true effect of the risk factor and the assumed selection mechanism; these are likely to differ between studies with different target populations. Bias due to sample selection is a key concern in COVID-19 research based on national registry data, especially as countries end free mass testing. The framework we have used can be applied by other researchers assessing the extent to which their results may be biased for their research question of interest.

[1]  Stephen R. Cole,et al.  Toward a clearer definition of selection bias when estimating causal effects. , 2022, Epidemiology.

[2]  N. Timpson,et al.  The Avon Longitudinal Study of Parents and Children - A resource for COVID-19 research: Antibody testing results, April – June 2021 , 2021, Wellcome open research.

[3]  M. Munafo,et al.  Smoking and COVID-19 outcomes: an observational and Mendelian randomisation study using the UK Biobank cohort , 2021, Thorax.

[4]  H. Freisling,et al.  Body mass index and risk of COVID-19 diagnosis, hospitalisation, and death: a cohort study of 2 524 926 Catalans , 2021, The Journal of clinical endocrinology and metabolism.

[5]  N. Timpson,et al.  Bias from questionnaire invitation and response in COVID-19 research: an example using ALSPAC , 2021, Wellcome open research.

[6]  Mattia G. Bergomi,et al.  Mapping the human genetic architecture of COVID-19 , 2021, Nature.

[7]  N. Timpson,et al.  The Avon Longitudinal Study of Parents and Children - A resource for COVID-19 research: Questionnaire data capture November 2020 - March 2021. , 2021, Wellcome open research.

[8]  Rosie P Cornish,et al.  Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework , 2021, Journal of Clinical Epidemiology.

[9]  S. O’Rahilly,et al.  Associations between body-mass index and COVID-19 severity in 6·9 million people in England: a prospective, community-based, cohort study , 2021, The Lancet Diabetes & Endocrinology.

[10]  N. Timpson,et al.  The Avon Longitudinal Study of Parents and Children - A resource for COVID-19 research: Home-based antibody testing results, October 2020. An emphasis on self-screening at a population level , 2021, Wellcome open research.

[11]  G. Davey Smith,et al.  Interrogating structural inequalities in COVID-19 mortality in England and Wales , 2021, Journal of Epidemiology & Community Health.

[12]  X. Jouven,et al.  COVID-19-related medical research: a meta-research and critical appraisal , 2021, BMC Medical Research Methodology.

[13]  N. Timpson,et al.  The Avon Longitudinal Study of Parents and Children - A resource for COVID-19 research: Questionnaire data capture May-July 2020 , 2020, Wellcome open research.

[14]  J. Florez,et al.  Cardiometabolic risk factors for COVID-19 susceptibility and severity: A Mendelian randomization analysis , 2020, medRxiv.

[15]  L. Morris,et al.  Risk Factors for Coronavirus Disease 2019 (COVID-19) Death in a Population Cohort Study from the Western Cape Province, South Africa , 2020, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[16]  P. Elliott,et al.  Risk factors for positive and negative COVID-19 tests: a cautious and in-depth analysis of UK biobank data , 2020, International journal of epidemiology.

[17]  Louisa H. Smith Selection Mechanisms and Their Consequences: Understanding and Addressing Selection Bias , 2020, Current Epidemiology Reports.

[18]  K. Bhaskaran,et al.  Factors associated with COVID-19-related death using OpenSAFELY , 2020, Nature.

[19]  N. Timpson,et al.  The Avon Longitudinal Study of Parents and Children - A resource for COVID-19 research: Questionnaire data capture April-May 2020. , 2020, Wellcome Open Research.

[20]  C. Gale,et al.  Ethnic disparities in hospitalisation for COVID-19 in England: The role of socioeconomic factors, mental health, and inflammatory and pro-inflammatory factors in a community-based cohort study , 2020, Brain, Behavior, and Immunity.

[21]  J. Sterne,et al.  Collider bias undermines our understanding of COVID-19 disease risk and severity , 2020, Nature Communications.

[22]  Tom R. Gaunt,et al.  The Avon Longitudinal Study of Parents and Children (ALSPAC): an update on the enrolled sample of index children in 2019 , 2019, Wellcome open research.

[23]  Michael J Crowther,et al.  Using simulation studies to evaluate statistical methods , 2017, Statistics in medicine.

[24]  C. Sudlow,et al.  Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants With Those of the General Population , 2017, American journal of epidemiology.

[25]  David M. Evans,et al.  Collider scope: when selection bias can substantially influence observed associations , 2016, bioRxiv.

[26]  Douglas G. Altman,et al.  No rationale for 1 variable per 10 events criterion for binary logistic regression analysis , 2016, BMC Medical Research Methodology.

[27]  D. Lawlor,et al.  Cohort Profile: The ‘Children of the 90s’—the index offspring of the Avon Longitudinal Study of Parents and Children , 2012, International journal of epidemiology.

[28]  D. Lawlor,et al.  Cohort Profile: The Avon Longitudinal Study of Parents and Children: ALSPAC mothers cohort , 2012, International journal of epidemiology.

[29]  S. Cole,et al.  Illustrating bias due to conditioning on a collider. , 2010, International journal of epidemiology.

[30]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.