Decentralized and reproducible geocoding and characterization of community and environmental exposures for multisite studies

Abstract Objective Geocoding and characterizing geographic, community, and environmental characteristics of study participants is frequently done in epidemiological studies. However, participant addresses are identifiable protected health information (PHI) and geocoding must be conducted in a Health Insurance Portability and Accountability Act–compliant manner. Our objective was to create a software application for this process that addresses limitations in current approaches. Materials and Methods We used a containerization platform to create DeGAUSS (Decentralized Geomarker Assessment for Multi-Site Studies), a software application that facilitates reproducible geocoding and geomarker assessment while maintaining the confidentiality of PHI. To validate the software, 215 350 addresses in Hamilton County, Ohio, were geocoded using DeGAUSS, ArcGIS, Google, and SAS and compared to a gold-standard approach. We distributed the DeGAUSS software to sites in an ongoing multisite study (Electronic Medical Records and Genomics, or eMERGE), and individual sites independently geocoded and assigned median census tract–level income and distance to nearest major roadway to their participants’ addresses, removed associated PHI, and returned deidentified data. Results Within a multisite study, 52 244 study participants’ addresses across 5 sites were geocoded with a median distance to roadway of 10 022m and a median census tract income of $57 266, demonstrating the feasibility of DeGAUSS within a multisite study. Compared to other commonly used geocoding platforms, DeGAUSS had similar geocoding and geomarker assessment accuracies. Conclusion The open source DeGAUSS software overcomes multiple challenges in the use of address data in multisite studies and also serves as a more general reproducible research tool for geocoding and geomarker assessment.

[1]  Colin W. Rundel,et al.  Interface to Geometry Engine - Open Source (GEOS) , 2015 .

[2]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[3]  Charles F. F. Karney Algorithms for geodesics , 2011, Journal of Geodesy.

[4]  Rémy Slama,et al.  Impact of Geocoding Methods on Associations between Long-term Exposure to Urban Air Pollution and Lung Function , 2013, Environmental health perspectives.

[5]  Paul A. Zandbergen,et al.  A comparison of address point, parcel and street geocoding techniques , 2008, Comput. Environ. Urban Syst..

[6]  Roger Bivand,et al.  Bindings for the Geospatial Data Abstraction Library , 2015 .

[7]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[8]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[9]  Hans-Werner Hense,et al.  Wer trifft ins Schwarze? Ein qualitativer Vergleich der kostenfreien Geokodierungsdienste von Google und OpenStreetMap , 2015 .

[10]  Jeffrey Roth,et al.  Potential selection bias associated with using geocoded birth records for epidemiologic research. , 2016, Annals of epidemiology.

[11]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[12]  E. Pebesma,et al.  Classes and Methods for Spatial Data , 2015 .

[13]  Jun Wang,et al.  Reliability of partial ambiguity fixing with multiple GNSS constellations , 2012, Journal of Geodesy.

[14]  H Checkoway,et al.  Bias due to misclassification in the estimation of relative risk. , 1977, American journal of epidemiology.

[15]  Bin Huang,et al.  Housing code violation density associated with emergency department and hospital use by children with asthma. , 2014, Health affairs.