Optimizing distance-based methods for large data sets

AbstractDistance-based methods for measuring spatial concentration of industries have received an increasing popularity in the spatial econometrics community. However, a limiting factor for using these methods is their computational complexity since both their memory requirements and running times are in $${\mathcal {O}}(n^2)$$O(n2). In this paper, we present an algorithm with constant memory requirements and shorter running time, enabling distance-based methods to deal with large data sets. We discuss three recent distance-based methods in spatial econometrics: the D&O-Index by Duranton and Overman (Rev Econ Stud 72(4):1077–1106, 2005), the M-function by Marcon and Puech (J Econ Geogr 10(5):745–762, 2010) and the Cluster-Index by Scholl and Brenner (Reg Stud (ahead-of-print):1–15, 2014). Finally, we present an alternative calculation for the latter index that allows the use of data sets with millions of firms.

[1]  Florence Puech,et al.  A typology of distance-based measures of spatial concentration , 2017 .

[2]  G. Arbia,et al.  Measuring industrial agglomeration with inhomogeneous K-function: the case of ICT firms in Milan (Italy) , 2010 .

[3]  A. Getis The Analysis of Spatial Association by Use of Distance Statistics , 2010 .

[4]  Alejandro Betancourt,et al.  A computationally efficient method for delineating irregularly shaped spatial clusters , 2011, J. Geogr. Syst..

[5]  Henry G. Overman,et al.  Testing for Localisation Using Micro-Geographic Data , 2002 .

[6]  Thomas Brenner,et al.  Detecting Spatial Clustering Using a Firm-Level Cluster Index , 2016 .

[7]  A. Baddeley,et al.  Non‐ and semi‐parametric estimation of interaction in inhomogeneous point patterns , 2000 .

[8]  Glenn Ellison,et al.  What Causes Industry Agglomeration? Evidence from Coagglomeration Patterns , 2007 .

[9]  H. R. Miller,et al.  The Data Avalanche is Here: Shouldn’t We Be Digging? , 2010 .

[10]  Hyun-Ju Koh,et al.  Assessing the Localization Pattern of German Manufacturing and Service Industries: A Distance-based Approach , 2014 .

[11]  Stan Openshaw,et al.  Modifiable Areal Unit Problem , 2008, Encyclopedia of GIS.

[12]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[13]  Giorgio Fagiolo,et al.  Spatial Localization in Manufacturing: A Cross-Country Analysis , 2013 .

[14]  Jørgen Lauridsen,et al.  Spatial point pattern analysis and industry concentration , 2011 .

[15]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[16]  Hanan Samet,et al.  A fast all nearest neighbor algorithm for applications involving large point-clouds , 2007, Comput. Graph..

[17]  Florence Puech,et al.  Measures of the geographic concentration of industries: improving distance-based methods , 2010 .

[18]  A. Briant,et al.  Location patterns of service industries in France: A distance-based approach , 2013 .

[19]  Stephane Traissac,et al.  A statistical test for ripley's K function rejection of poisson null hypothesis , 2013 .

[20]  Andrew W. Moore,et al.  Rapid detection of significant spatial clusters , 2004, KDD.

[21]  Florence Puech,et al.  Evaluating the geographic concentration of industries using distance-based methods , 2003 .