Modeling the probability distribution of positional errors incurred by residential address geocoding

BackgroundThe assignment of a point-level geocode to subjects' residences is an important data assimilation component of many geographic public health studies. Often, these assignments are made by a method known as automated geocoding, which attempts to match each subject's address to an address-ranged street segment georeferenced within a streetline database and then interpolate the position of the address along that segment. Unfortunately, this process results in positional errors. Our study sought to model the probability distribution of positional errors associated with automated geocoding and E911 geocoding.ResultsPositional errors were determined for 1423 rural addresses in Carroll County, Iowa as the vector difference between each 100%-matched automated geocode and its true location as determined by orthophoto and parcel information. Errors were also determined for 1449 60%-matched geocodes and 2354 E911 geocodes. Huge (> 15 km) outliers occurred among the 60%-matched geocoding errors; outliers occurred for the other two types of geocoding errors also but were much smaller. E911 geocoding was more accurate (median error length = 44 m) than 100%-matched automated geocoding (median error length = 168 m). The empirical distributions of positional errors associated with 100%-matched automated geocoding and E911 geocoding exhibited a distinctive Greek-cross shape and had many other interesting features that were not capable of being fitted adequately by a single bivariate normal or t distribution. However, mixtures of t distributions with two or three components fit the errors very well.ConclusionMixtures of bivariate t distributions with few components appear to be flexible enough to fit many positional error datasets associated with geocoding, yet parsimonious enough to be feasible for nascent applications of measurement-error methodology to spatial epidemiology.

[1]  Michael Jerrett,et al.  Conceptual and practical issues in the detection of local disease clusters: a study of mortality in Hamilton, Ontario , 2002 .

[2]  S. Dearwent,et al.  Locational uncertainty in georeferencing public health datasets , 2001, Journal of Exposure Analysis and Environmental Epidemiology.

[3]  J W Hogan,et al.  On the wrong side of the tracts? Evaluating the accuracy of geocoding in public health research. , 2001, American journal of public health.

[4]  David R. Anderson,et al.  Bayesian Methods in Cosmology: Model selection and multi-model inference , 2009 .

[5]  Amy H Herring,et al.  Comparison of residential geocoding methods in population-based study of air quality and birth defects. , 2006, Environmental research.

[6]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[7]  Thomas O Talbot,et al.  Positional error in automated geocoding of residential addresses , 2003, International journal of health geographics.

[8]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[9]  Andrew B. Lawson,et al.  Statistical Methods in Spatial Epidemiology , 2001 .

[10]  Russell G. Congalton,et al.  Quantifying Spatial Uncertainty in Natural Resources: Theory and Applications for GIS and Remote Sensing , 2000 .

[11]  J. Kiefer,et al.  CONSISTENCY OF THE MAXIMUM LIKELIHOOD ESTIMATOR IN THE PRESENCE OF INFINITELY MANY INCIDENTAL PARAMETERS , 1956 .

[12]  Dale Zimmerman,et al.  Statistical Methods for Incompletely and Incorrectly Geocoded Cancer Data , 2007 .

[13]  Jing Nie,et al.  Positional Accuracy of Geocoded Addresses in Epidemiologic Research , 2003, Epidemiology.

[14]  Geoffrey J. McLachlan,et al.  Robust mixture modelling using the t distribution , 2000, Stat. Comput..

[15]  Jarrett J. Barber,et al.  Modelling map positional error to infer true feature location , 2006 .

[16]  Duanping Liao,et al.  Accuracy and repeatability of commercial geocoding. , 2004, American journal of epidemiology.

[17]  Richard D. Mrozinski,et al.  Subject loss in spatial analysis of breast cancer. , 1999, Health & place.

[18]  Richard L. Smith,et al.  Accuracy of commercial geocoding: assessment and implications , 2006, Epidemiologic perspectives & innovations : EP+I.

[19]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[20]  L A Waller Statistical power and design of focused clustering studies. , 1996, Statistics in medicine.

[21]  N. Kiefer Discrete Parameter Variation: Efficient Estimation of a Switching Regression Model , 1978 .

[22]  Dale L. Zimmerman,et al.  Estimating Spatial Intensity and Variation in Risk from Locations Subject to Geocoding Errors , 2006 .

[23]  Francis P. Boscoe The Science and Art of Geocoding: Tips for Improving Match Rates and Handling Unmatched Cases in Analysis , 2007 .

[24]  G. McLachlan,et al.  Likelihood Estimation with Normal Mixture Models , 1985 .

[25]  A. Curtis,et al.  Spatial confidentiality and GIS: re-engineering mortality locations from published maps about Hurricane Katrina , 2006, International journal of health geographics.

[26]  P. Nurmi Mixture Models , 2008 .

[27]  J. Wakefield,et al.  Spatial epidemiology: methods and applications. , 2000 .

[28]  N Krieger,et al.  Changing to the 2000 standard million: are declining racial/ethnic and socioeconomic inequalities in health real progress or statistical illusion? , 2001, American journal of public health.

[29]  Joanne S Colt,et al.  Positional Accuracy of Two Methods of Geocoding , 2005, Epidemiology.

[30]  L. Pickle,et al.  Geographic bias related to geocoding in epidemiologic studies , 2005, International journal of health geographics.

[31]  John Noel A. C Gabrosek,et al.  The Effect on Attribute Prediction of Location Uncertainty in Spatial Data , 2002 .

[32]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[33]  B. Everitt,et al.  Finite Mixture Distributions , 1981 .

[34]  S. Scobie Spatial epidemiology: methods and applications , 2003 .

[35]  Dale L. Zimmerman,et al.  Estimating Spatial Intensity and Variation in Risk from Locations Coarsened by Incomplete Geocoding , 2006 .

[36]  A. W. Kemp,et al.  Statistics for the Environment. , 1993 .

[37]  Duck-Hye Yang,et al.  Improving Geocoding Practices: Evaluation of Geocoding Tools , 2004, Journal of Medical Systems.

[38]  Geoffrey M Jacquez,et al.  Current practices in the spatial analysis of cancer: flies in the ointment , 2004, International journal of health geographics.

[39]  N. Cressie,et al.  Spatial Statistics in the Presence of Location Error with an Application to Remote Sensing of the Environment , 2003 .

[40]  L. Waller,et al.  Applied Spatial Statistics for Public Health Data , 2004 .

[41]  Lance A. Waller,et al.  The Effect of Uncertain Locations on Disease Cluster Statistics , 2008 .

[42]  L. Waller,et al.  Applied Spatial Statistics for Public Health Data: Waller/Applied Spatial Statistics , 2004 .

[43]  Nataliya Kravets,et al.  The accuracy of address coding and the effects of coding errors. , 2007, Health & place.

[44]  Amy Trentham-Dietz,et al.  Geocoding Addresses from a Large Population-based Study: Lessons Learned , 2003, Epidemiology.