Improving geocoding matching rates of structured addresses in Rio de Janeiro, Brazil.

Strategies for improving geocoded data often rely on interactive manual processes that can be time-consuming and impractical for large-scale projects. In this study, we evaluated different automated strategies for improving address quality and geocoding matching rates using a large dataset of addresses from death records in Rio de Janeiro, Brazil. Mortality data included 132,863 records with address information in a structured format. We performed regular expressions and dictionary-based methods for address standardization and enrichment. All records were linked by their postal code or street name to the Brazilian National Address Directory (DNE) obtained from Brazil's Postal Service. Residential addresses were geocoded using Google Maps. Records with address data validated down to the street level and location type returned as rooftop, range interpolated, or geometric center were considered a geocoding match. The overall performance was assessed by manually reviewing a sample of addresses. Out of the original 132,863 records, 85.7% (n = 113,876) were geocoded and validated, out of which 83.8% were matched as rooftop (high accuracy). Overall sensitivity and specificity were 87% (95%CI: 86-88) and 98% (95%CI: 96-99), respectively. Our results indicate that address quality and geocoding completeness can be reliably improved with an automated geocoding process. R scripts and instructions to reproduce all the analyses are available at https://github.com/reprotc/geocoding.

[1]  Jiyeong Lee,et al.  Improving a Street-Based Geocoding Algorithm Using Machine Learning Techniques , 2020, Applied Sciences.

[2]  Luzia Gonçalves,et al.  Common Medical and Statistical Problems: The Dilemma of the Sample Size Calculation for Sensitivity and Specificity Estimation , 2020, Mathematics.

[3]  J. Clougherty,et al.  Geocoding Error, Spatial Uncertainty, and Implications for Exposure Assessment and Environmental Epidemiology , 2020, International journal of environmental research and public health.

[4]  Daniel Arribas-Bel,et al.  Machine learning innovations in address matching: A practical comparison of word2vec and CRFs , 2019, Trans. GIS.

[5]  I. Silveira,et al.  Utilização do Google Maps para o georreferenciamento de dados do Sistema de Informações sobre Mortalidade no município do Rio de Janeiro, 2010-2012* , 2017 .

[6]  T. Edwin Chow,et al.  Geographic disparity of positional errors and matching rate of residential addresses among geocoding solutions , 2016, Ann. GIS.

[7]  Clodoveu A. Davis,et al.  Evaluation of the quality of an online geocoding resource in the context of a large Brazilian city , 2011, Trans. GIS.

[8]  P. Zandbergen Geocoding Quality and Implications for Spatial Analysis , 2009 .

[9]  Marilia Sá Carvalho,et al.  Geoprocessamento dos dados da saúde: o tratamento dos endereços , 2004 .

[10]  I. Silveira,et al.  [Use of Google Maps for geocoding data from the Mortality Information System in Rio de Janeiro municipality, Brazil, 2010-2012]. , 2017, Epidemiologia e servicos de saude : revista do Sistema Unico de Saude do Brasil.

[11]  Craig A. Knoblock,et al.  From Text to Geographic Coordinates: The Current State of Geocoding , 2007 .

[12]  Rafael Giusti,et al.  Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary , 2007 .