Population‐calibrated multiple imputation for a binary/categorical covariate in categorical regression models

Multiple imputation (MI) has become popular for analyses with missing data in medical research. The standard implementation of MI is based on the assumption of data being missing at random (MAR). However, for missing data generated by missing not at random mechanisms, MI performed assuming MAR might not be satisfactory. For an incomplete variable in a given data set, its corresponding population marginal distribution might also be available in an external data source. We show how this information can be readily utilised in the imputation model to calibrate inference to the population by incorporating an appropriately calculated offset termed the “calibrated‐δ adjustment.” We describe the derivation of this offset from the population distribution of the incomplete variable and show how, in applications, it can be used to closely (and often exactly) match the post‐imputation distribution to the population level. Through analytic and simulation studies, we show that our proposed calibrated‐δ adjustment MI method can give the same inference as standard MI when data are MAR, and can produce more accurate inference under two general missing not at random missingness mechanisms. The method is used to impute missing ethnicity data in a type 2 diabetes prevalence case study using UK primary care electronic health records, where it results in scientifically relevant changes in inference for non‐White ethnic groups compared with standard MI. Calibrated‐δ adjustment MI represents a pragmatic approach for utilising available population‐level information in a sensitivity analysis to explore potential departures from the MAR assumption.

[1]  J. Chisholm,et al.  The Read clinical classification. , 1990, BMJ.

[2]  Irene Petersen,et al.  Creating medical and drug code lists to identify cases in primary care databases , 2009, Pharmacoepidemiology and drug safety.

[3]  D. Black HEALTH AND DEPRIVATION: Inequality and the north , 1988 .

[4]  I. Petersen,et al.  Trends in incidence, prevalence and prescribing in type 2 diabetes mellitus between 2000 and 2013 in primary care: a retrospective cohort study , 2016, BMJ Open.

[5]  A. Bourke,et al.  Generalisability of The Health Improvement Network (THIN) database: demographics, chronic disease prevalence and mortality rates. , 2011, Informatics in primary care.

[6]  Ian R. White,et al.  Simsum: Analyses of Simulation Studies Including Monte Carlo Error , 2010 .

[7]  Rosie Cornish,et al.  Using Linkage to Electronic Primary Care Records to Evaluate Recruitment and Nonresponse Bias in The Avon Longitudinal Study of Parents and Children , 2015, Epidemiology.

[8]  A. Bourke,et al.  Feasibility study and methodology to create a quality-evaluated database of primary care data. , 2004, Informatics in primary care.

[9]  Irwin Nazareth,et al.  Cardiovascular risk prediction models for people with severe mental illness: results from the prediction and management of cardiovascular risk in people with severe mental illnesses (PRIMROSE) research program. , 2015, JAMA psychiatry.

[10]  J. Miller Numerical Analysis , 1966, Nature.

[11]  Patrick Royston,et al.  The design of simulation studies in medical statistics , 2006, Statistics in medicine.

[12]  S de Lusignan,et al.  Ethnicity recording in general practice computer systems. , 2006, Journal of public health.

[13]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[14]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[15]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[16]  A. Maguire,et al.  The importance of defining periods of complete mortality reporting for research using automated data from primary care , 2009, Pharmacoepidemiology and drug safety.

[17]  J. Carpenter,et al.  Issues in multiple imputation of missing data for large general practice clinical databases , 2010, Pharmacoepidemiology and drug safety.

[18]  B. Erens,et al.  The Health Survey for England , 1999 .

[19]  K. Bhaskaran,et al.  Completeness and usability of ethnicity data in UK-based primary care and hospital databases , 2013, Journal of public health.

[20]  A. Sheikh,et al.  Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2 , 2008, BMJ : British Medical Journal.

[21]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[22]  P. Aspinall,et al.  Why poor quality of ethnicity data should not preclude its use for identifying disparities in health and healthcare , 2007, Quality and Safety in Health Care.

[23]  I. White,et al.  Smoker, ex-smoker or non-smoker? The validity of routinely recorded smoking status in UK primary care: a cross-sectional study , 2014, BMJ Open.

[24]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[25]  Michael G Kenward,et al.  Multiple imputation: current perspectives , 2007, Statistical methods in medical research.

[26]  Yang Yuan,et al.  Multiple Imputation Using SAS Software , 2011 .

[27]  Katherine J. Lee,et al.  The rise of multiple imputation: a review of the reporting and implementation of the method in medical research , 2015, BMC Medical Research Methodology.

[28]  John B. Carlin,et al.  Bias and efficiency of multiple imputation compared with complete‐case analysis for missing covariate values , 2010, Statistics in medicine.

[29]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[30]  J. Schafer,et al.  A comparison of inclusive and restrictive strategies in modern missing data procedures. , 2001, Psychological methods.

[31]  Kate Walters,et al.  Identifying periods of acceptable computer usage in primary care research databases , 2013, Pharmacoepidemiology and drug safety.

[32]  R. Little Pattern-Mixture Models for Multivariate Incomplete Data , 1993 .

[33]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[34]  Trivellore E. Raghunathan,et al.  Missing Data Analysis in Practice , 2015 .

[35]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .

[36]  An algorithm for identification and classification of individuals with type 1 and type 2 diabetes mellitus in a large primary care database , 2016, Clinical epidemiology.

[37]  Ian R White,et al.  Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals , 2004, Clinical trials.

[38]  Stephen R Cole,et al.  Use of multiple imputation in the epidemiologic literature. , 2008, American journal of epidemiology.

[39]  Roderick J. A. Little,et al.  A Class of Pattern-Mixture Models for Normal Incomplete Data , 1994 .

[40]  S. Russ,et al.  A translation of Bolzano's paper on the intermediate value theorem , 1980 .

[41]  Ofer Harel,et al.  Asymptotically Unbiased Estimation of Exposure Odds Ratios in Complete Records Logistic Regression , 2015, American journal of epidemiology.

[42]  Anna Genell,et al.  Bias in odds ratios by logistic regression modelling and sample size , 2009, BMC medical research methodology.