A Framework for Understanding Selection Bias in Real-World Healthcare Data

Using administrative patient-care data such as Electronic Health Records (EHR) and medical/ pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect estimates of the association between a binary outcome and an exposure (continuous or categorical) of interest. We consider four easy-to-implement weighting approaches to reduce selection bias with accompanying variance formulae. We demonstrate through a simulation study when they can rescue us in practice with analysis of real world data. We compare these methods using a data example where our goal is to estimate the well-known association of cancer and biological sex, using EHR from a longitudinal biorepository at the University of Michigan Healthcare system. We provide annotated R codes to implement these weighted methods with associated inference.

[1]  Spiros C. Denaxas,et al.  Harmonising electronic health records for reproducible research: challenges, solutions and recommendations from a UK-wide COVID-19 research collaboration , 2023, BMC Medical Informatics and Decision Making.

[2]  K. Vogan Global biobank meta-analysis , 2022, Nature Genetics.

[3]  Lauren J. Beesley,et al.  Case studies in bias reduction and inference for electronic health record data with selection bias and phenotype misclassification , 2022, Statistics in medicine.

[4]  T. Galama,et al.  Reweighting the UK Biobank to reflect its underlying sampling population substantially reduces pervasive selection bias due to volunteering , 2022, medRxiv.

[5]  Chen Shen,et al.  Efficacy of COVID-19 vaccines in patients taking immunosuppressants , 2022, Annals of the Rheumatic Diseases.

[6]  L. Fritsche,et al.  Estimating COVID-19 Vaccination Effectiveness Using Electronic Health Records of an Academic Medical Center in Michigan , 2022, medRxiv.

[7]  Rui Wang,et al.  Use of Linked Databases for Improved Confounding Control: Considerations for Potential Selection Bias. , 2022, American journal of epidemiology.

[8]  R. Hubbard,et al.  SAT: a Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies , 2021, J. Am. Medical Informatics Assoc..

[9]  G. Abecasis,et al.  The Michigan Genomics Initiative: A biobank linking genotypes and electronic clinical records in Michigan Medicine patients , 2021, medRxiv.

[10]  Rebecca A. Hubbard,et al.  A cost-effective chart review sampling design to account for phenotyping error in electronic health records (EHR) data , 2021, J. Am. Medical Informatics Assoc..

[11]  João Rafael Almeida,et al.  A methodology for cohort harmonisation in multicentre clinical research , 2021, Informatics in Medicine Unlocked.

[12]  Rumi Chunara,et al.  Machine learning and algorithmic fairness in public and population health , 2021, Nature Machine Intelligence.

[13]  Isto Huvila,et al.  Cancer patients’ information seeking behavior related to online electronic healthcare records , 2021, Health Informatics J..

[14]  D. Sejdinovic,et al.  Unrepresentative big surveys significantly overestimated US vaccine uptake , 2021, Nature.

[15]  Jae Kwang Kim,et al.  Information projection approach to propensity score estimation for handling selection bias under missing at random , 2021, 2104.13469.

[16]  Adam Wright,et al.  Characterizing outpatient problem list completeness and duplications in the electronic health record , 2020, J. Am. Medical Informatics Assoc..

[17]  Hongfang Liu,et al.  Assessment of the impact of EHR heterogeneity for clinical research through a case study of silent brain infarction , 2020, BMC Medical Informatics and Decision Making.

[18]  Lars G Fritsche,et al.  An analytic framework for exploring sampling and observation process biases in genome and phenome‐wide association studies using electronic health records , 2020, Statistics in medicine.

[19]  Bhramar Mukherjee,et al.  Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification , 2019, Biometrics.

[20]  Lars G Fritsche,et al.  The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities , 2019, Statistics in medicine.

[21]  Ravi B. Parikh,et al.  Addressing Bias in Artificial Intelligence in Health Care. , 2019, JAMA.

[22]  Brian McKinstry,et al.  The "All of Us" Research Program. , 2019, The New England journal of medicine.

[23]  Jing Huang,et al.  An augmented estimation procedure for EHR-based association studies accounting for differential misclassification , 2019, J. Am. Medical Informatics Assoc..

[24]  Earl F. Glynn,et al.  Heterogeneity introduced by EHR system implementation in a de-identified data resource from 100 non-affiliated organizations , 2019, JAMIA open.

[25]  R. Hubbard,et al.  Inflation of type I error rates due to differential misclassification in EHR‐derived outcomes: Empirical illustration using breast cancer recurrence , 2018, Pharmacoepidemiology and drug safety.

[26]  Eric J. Tchetgen Tchetgen,et al.  Multiply robust causal inference with double‐negative control adjustment for categorical unmeasured confounding , 2018, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[27]  Stephanie E. Moser,et al.  Exploring various polygenic risk scores for skin cancer in the phenomes of the Michigan genomics initiative and the UK Biobank with a visual catalog: PRSWeb , 2018, bioRxiv.

[28]  Pengfei Li,et al.  Doubly Robust Inference With Nonprobability Survey Samples , 2018, Journal of the American Statistical Association.

[29]  Jing Huang,et al.  PIE: A prior knowledge guided integrated likelihood estimation method for bias reduction in association studies using electronic health records data , 2017, J. Am. Medical Informatics Assoc..

[30]  C. Sudlow,et al.  Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants With Those of the General Population , 2017, American journal of epidemiology.

[31]  Tsipi Heart,et al.  A review of PHR, EMR and EHR integration: A more personalized healthcare and public health policy , 2017 .

[32]  N. Risch,et al.  Genome-wide association analyses using electronic health records identify new loci influencing blood pressure variation , 2016, Nature Genetics.

[33]  S. Haneuse,et al.  A General Framework for Considering Selection Bias in EHR-Based Studies: What Data Are Observed and Why? , 2016, EGEMS.

[34]  Z. Qiu,et al.  simplexreg: An R Package for Regression Analysis of Proportional Data Using the Simplex Distribution , 2016 .

[35]  Sylvie Chevret,et al.  A multiple imputation approach for MNAR mechanisms compatible with Heckman's model , 2016, Statistics in medicine.

[36]  Rachel Gold,et al.  Supporting health insurance expansion: do electronic health records have valid insurance verification and enrollment data? , 2015, J. Am. Medical Informatics Assoc..

[37]  David A Chambers,et al.  Big Data and Large Sample Size: A Cautionary Note on the Potential for Bias , 2014, Clinical and Translational Science.

[38]  L. L. Doove,et al.  Recursive partitioning for missing data imputation in the presence of interaction effects , 2014, Comput. Stat. Data Anal..

[39]  D. Madigan,et al.  A Systematic Statistical Approach to Evaluating Evidence from Observational Studies , 2014 .

[40]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[41]  Sengwee Toh,et al.  Confounding adjustment via a semi‐automated high‐dimensional propensity score algorithm: an application to electronic medical records , 2011, Pharmacoepidemiology and drug safety.

[42]  M. Lipsitch,et al.  Negative Controls: A Tool for Detecting Confounding and Bias in Observational Studies , 2010, Epidemiology.

[43]  Jae Kwang Kim,et al.  Calibration Estimation in Survey Sampling , 2010 .

[44]  Michael R. Elliott,et al.  Combining Data from Probability and Non- Probability Samples Using Pseudo-Weights , 2009 .

[45]  S. Richardson,et al.  Adjusting for selection bias in retrospective, case-control studies. , 2008, Biostatistics.

[46]  A. Tsiatis Semiparametric Theory and Missing Data , 2006 .

[47]  Giorgio E. Montanari,et al.  Nonparametric Model Calibration Estimation in Survey Sampling , 2005 .

[48]  J. Robins,et al.  A Structural Approach to Selection Bias , 2004, Epidemiology.

[49]  S. Ferrari,et al.  Beta Regression for Modelling Rates and Proportions , 2004 .

[50]  J. Neuhaus Bias and efficiency loss due to misclassified responses in binary regression , 1999 .

[51]  G. A. Marcoulides,et al.  Advanced structural equation modeling : issues and techniques , 1996 .

[52]  R. Little Pattern-Mixture Models for Multivariate Incomplete Data , 1993 .

[53]  J. Olsen,et al.  Selection bias in genetic-epidemiological studies of cleft lip and palate. , 1992, American journal of human genetics.

[54]  C. Särndal,et al.  Calibration Estimators in Survey Sampling , 1992 .

[55]  Ole E. Barndorff-Nielsen,et al.  Some parametric models on the simplex , 1991 .

[56]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[57]  L L Kupper,et al.  Selection bias in epidemiologic studies. , 1981, American journal of epidemiology.

[58]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[59]  Stijn Vansteelandt,et al.  Introduction to Double Robust Methods for Incomplete Data. , 2018, Statistical science : a review journal of the Institute of Mathematical Statistics.

[60]  Blair H. Smith,et al.  University of Dundee A Genome-Wide Association Study Finds Genetic Associations with Broadly-Defined Headache in UK Biobank ( N = 223 , 773 ) , 2018 .

[61]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[62]  Marylyn D. Ritchie,et al.  Visual Integration of Results from a Large Dna Biobank (Biovu) Using Synthesis-View , 2011, Pacific Symposium on Biocomputing.

[63]  WU Bchangbao Optimal calibration estimators in survey sampling , 2003 .

[64]  Yi-Hau Chen,et al.  A unified approach to regression analysis under double‐sampling designs , 2000 .

[65]  E. C. Hammond,et al.  Smoking and lung cancer: recent evidence and a discussion of some questions. , 1959, Journal of the National Cancer Institute.