Geocoding Billions of Addresses: Toward a Spatial Record Linkage System with Big Data

Address is one of the most commonly used spatial data in everyday life. Comparing two addresses (e.g., if they are referring to the same location) is a fundamental problem for address-related record linkage. In this paper, a fast, reliable, expandable address parser/standardizer/geocoder has been developed as an initial step towards spatial record linkage. First, a CASS-based geocoding test set was created and performance of on-line geocoding API providers (Google, Yahoo, Bing) was evaluated. Considering high time consumption and geocoding precision flaws, we developed an in-house TIGER/Line based hierarchical geocoder, Intelius Address Parser (IAP) that provides on-par geocoding precision compared to on-line geocoding APIs. Given over one billion addresses, on a 25-node Hadoop cluster setup on with Amazon AWS, the time consumption and cost are reported and compared with commercial solutions. Strategies for using geocoded addresses for record linkage is presented and plans on expanding the use of geocoded result are discussed.