Given an N x N grid of squares, where each square has a count cij and an underlying population pij, our goal is to find the rectangular region with the highest density, and to calculate its significance by randomization. An arbitrary density function D, dependent on a region's total count C and total population P, can be used. For example, if each count represents the number of disease cases occurring in that square, we can use Kulldorff's spatial scan statistic DK to find the most significant spatial disease cluster. A naive approach to finding the maximum density region requires O(N4) time, and is generally computationally infeasible. We present a multiresolution algorithm which partitions the grid into overlapping regions using a novel overlap-kd tree data structure, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For sufficiently dense regions, this method finds the maximum density region in O((N log N)2) time, in practice resulting in significant (20-2000x) speedups on both real and simulated datasets.
[1]
Michael Ian Shamos,et al.
Computational geometry: an introduction
,
1985
.
[2]
A. Craft,et al.
INVESTIGATION OF LEUKAEMIA CLUSTERS BY USE OF A GEOGRAPHICAL ANALYSIS MACHINE
,
1988,
The Lancet.
[3]
Hanan Samet,et al.
The Design and Analysis of Spatial Data Structures
,
1989
.
[4]
M Kulldorff,et al.
Spatial disease clusters: detection and inference.
,
1995,
Statistics in medicine.
[5]
Andrew W. Moore,et al.
Multiresolution Instance-Based Learning
,
1995,
IJCAI.
[6]
Jiong Yang,et al.
STING: A Statistical Information Grid Approach to Spatial Data Mining
,
1997,
VLDB.
[7]
Dimitrios Gunopulos,et al.
Automatic subspace clustering of high dimensional data for data mining applications
,
1998,
SIGMOD '98.
[8]
M. Kulldorff.
Spatial Scan Statistics: Models, Calculations, and Applications
,
1999
.
[9]
Andrew W. Moore,et al.
A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters
,
2003,
NIPS.