For a sound use of health care data in epidemiology: evaluation of a calibration model for count data with application to prediction of cancer incidence in areas without cancer registry.

There is a growing interest in using health care (HC) data to produce epidemiological surveillance indicators such as incidence. Typically, in the field of cancer, incidence is provided by local cancer registries which, in many countries, do not cover the whole territory; using proxy measures from available nationwide HC databases would appear to be a suitable approach to fill this gap. However, in most cases, direct counts from these databases do not provide reliable measures of incidence. To obtain accurate incidence estimations and prediction intervals, these databases need to be calibrated using a registry-based gold standard measure of incidence. This article presents a calibration model for count data developed to predict cancer incidence from HC data in geographical areas without cancer registries. First, the ratio between the proxy measure and incidence is modeled in areas with registries using a Poisson mixed model that allows for heterogeneity between areas (calibration stage). This ratio is then inverted to predict incidence from the proxy measure in areas without registries. Prediction error admits closed-form expression which accounts for heterogeneity in the ratio between areas. A simulation study shows the accuracy of our method in terms of prediction and coverage probability. The method is further applied to predict the incidence of two cancers in France using hospital data as the proxy measure. We hope this approach will encourage sound use of the usually imperfect information extracted from HC data.

[1]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[2]  G. Launoy,et al.  Descriptive epidemiology of upper aerodigestive tract cancers in France: incidence over 1980-2005 and projection to 2010. , 2011, Oral oncology.

[3]  Christine Osborne,et al.  Statistical Calibration: A Review , 1991 .

[4]  L. Pedersen,et al.  Clinical epidemiology in the era of big data: new opportunities, familiar challenges , 2017, Clinical epidemiology.

[5]  A. Schott,et al.  Breast cancer incidence using administrative data: correction with sensitivity and specificity. , 2009, Journal of clinical epidemiology.

[6]  Pedro Puig,et al.  A new inverse regression model applied to radiation biodosimetry , 2015, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[7]  N. Breslow,et al.  Approximate inference in generalized linear mixed models , 1993 .

[8]  B. Trétarre,et al.  [Predictive value and sensibility of hospital discharge system (PMSI) compared to cancer registries for thyroïd cancer (1999-2000)]. , 2006, Revue d'epidemiologie et de sante publique.

[9]  P. Vidoni Prediction and calibration in generalized linear models , 2003 .

[10]  Raymond J. Carroll,et al.  Measurement error in nonlinear models: a modern perspective , 2006 .

[11]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[12]  A. Schott,et al.  Is it possible to estimate the incidence of breast cancer from medico-administrative databases? , 2008, European Journal of Epidemiology.

[13]  M. Colonna,et al.  Éléments d’interprétation des estimations régionales de l’incidence du cancer en France sur la période 1980–2005 , 2008 .

[14]  A. Weill,et al.  Cancer incidence estimation at a district level without a national registry: a validation study for 24 cancer sites using French health insurance and registry data. , 2013, Cancer epidemiology.

[15]  Peter C Austin,et al.  Estimating Multilevel Logistic Regression Models When the Number of Clusters is Low: A Comparison of Different Statistical Software Procedures , 2010, The international journal of biostatistics.

[16]  L. Fenton The Sum of Log-Normal Probability Distributions in Scatter Transmission Systems , 1960 .

[17]  L. Remontet,et al.  A Suitable Approach to Estimate Cancer Incidence in Area without Cancer Registry , 2011, Journal of cancer epidemiology.

[18]  C. McCulloch,et al.  Misspecifying the Shape of a Random Effects Distribution: Why Getting It Wrong May Not Matter , 2011, 1201.1980.

[19]  S. Rabe-Hesketh,et al.  Prediction in multilevel generalized linear models , 2009 .

[20]  R. De Angelis,et al.  Methodology for Estimation of Cancer Incidence, Survival and Prevalence in Italian Regions , 2007, Tumori.