An Overview of Population Size Estimation where Linking Registers Results in Incomplete Covariates, with an Application to Mode of Transport of Serious Road Casualties

Abstract We consider the linkage of two or more registers in the situation where the registers do not cover the whole target population, and relevant categorical auxiliary variables (unique to one of the registers; although different variables could be present on each register) are available in addition to the usual matching variable(s). The linked registers therefore do not contain full information on either the observations (often individuals) or the variables. By treating this as a missing data problem it is possible to construct a linked data set, adjusted to estimate the part of the population missed by both registers, and containing completed covariate information for all the registers. This is achieved using an Expectation-Maximization (EM)-algorithm. We elucidate the properties of this approach where the model is appropriate and in situations corresponding with real applications in official statistics, and also where the model conditions are violated. The approach is applied to data on road accidents in the Netherlands, where the cause of the accident is denoted by the police and by the hospital. Here the cause of the accident denoted by the police is considered as missing information for the statistical units only registered by the hospital, and the other way around. The method needs to be widely applied to give a better impression of the range of problems where it can be beneficial.

[1]  A Chao,et al.  The applications of capture‐recapture models to epidemiological data , 2001, Statistics in medicine.

[2]  M. Reurings,et al.  Estimating the number of serious road injuries in the Netherlands. , 2011, Annals of epidemiology.

[3]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[4]  S E Fienberg,et al.  A three-sample multiple-recapture approach to census population estimation with heterogeneous catchability. , 1993, Journal of the American Statistical Association.

[5]  K. Pollock The use of auxiliary variables in capture-recapture modelling: An overview , 2002 .

[6]  Baker Sg A simple EM algorithm for capture-recapture data with categorical covariates. , 1990 .

[7]  E Pelle,et al.  A log‐linear multidimensional Rasch model for capture–recapture , 2016, Statistics in medicine.

[8]  K. Wolter Some coverage error models for census data. , 1986, Journal of the American Statistical Association.

[9]  Peter G M van der Heijden,et al.  Analysing capture--recapture data when some variables of heterogeneous catchability are not collected or asked in all registrations. , 2007, Statistics in medicine.

[10]  Owen Abbott,et al.  Design of the 2001 and 2011 Census Coverage Surveys for England and Wales , 2011 .

[11]  W. Bell Using information from demographic analysis in post-enumeration survey estimation. , 1993, Journal of the American Statistical Association.

[12]  S. Fienberg,et al.  Classical multilevel and Bayesian approaches to population size estimation using multiple lists , 1999 .

[13]  Eugene Zwane,et al.  Capture-recapture Studies with Incomplete Mixed Categorical and Continuous Covariates , 2008, Journal of Data Science.

[14]  Roderick J. A. Little,et al.  A Bayesian Approach to Combining Information from a Census, a Coverage Measurement Survey, and Demographic Analysis , 2000 .

[15]  Eugene Zwane,et al.  Population estimation using the multiple system estimator in the presence of continuous covariates , 2005 .

[16]  J. York,et al.  Bayesian methods for estimation of the size of a closed population , 1997 .

[17]  Anne Gallay,et al.  A three-source capture-recapture estimate of the number of new HIV diagnoses in children in France from 2003–2006 with multiple imputation of a variable of heterogeneous catchability , 2012, BMC Infectious Diseases.

[18]  R. Huggins On the statistical analysis of capture experiments , 1989 .

[19]  S. C. Gerritse An application of population size estimation to official statistics. Sensitivity of model assumptions and the effect of implied coverage , 2012 .

[20]  I D Diamond,et al.  A methodological strategy for a one‐number census in the UK , 1999, Journal of the Royal Statistical Society. Series A,.

[21]  J. Alho Logistic regression in capture-recapture models. , 1990, Biometrics.

[22]  Jessika Weiss,et al.  Graphical Models In Applied Multivariate Statistics , 2016 .

[23]  Peter G M van der Heijden,et al.  The multiple‐record systems estimator when registrations refer to different but overlapping populations , 2004, Statistics in medicine.

[24]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[25]  David J. Hessen,et al.  Structurally missing data problems in multiple list capture–recapture data , 2009 .

[26]  Bart F.M. Bakker,et al.  Different methods to complete datasets used for capture-recapture estimation: Estimating the number of usual residents in the Netherlands , 2015 .

[27]  Li-Chun Zhang On Modelling Register Coverage Errors , 2015 .

[28]  J. Schafer,et al.  Analysis of incomplete multivariate data / J.L. Schafer , 1997 .

[29]  Ian Diamond,et al.  Dependence in the 2001 one‐number census project , 2006 .

[30]  Carl James Schwarz,et al.  Multilist Population Estimation with Incomplete and Partial Stratification , 2007, Biometrics.

[31]  Bart F.M. Bakker,et al.  Sensitivity of Population Size Estimation for Violating Parametric Assumptions in Log-linear Models , 2015 .

[32]  Paul H. Garthwaite,et al.  Quantifying Precision of Mark-Recapture Estimates Using the Bootstrap and Related Methods , 1991 .

[33]  Peter G. M. van der Heijden,et al.  People born in the Middle East but residing in the Netherlands: Invariant population size estimates and the role of active and passive covariates , 2012, 1209.6141.

[34]  K. Tilling,et al.  Capture-recapture models including covariate effects. , 1999, American journal of epidemiology.

[35]  Richard A. Griffin,et al.  Potential Uses of Administrative Records for Triple System Modeling for Estimation of Census Coverage Error in 2020 , 2014 .