Given a collection of geo-distributed points, we aim to detect statistically significant clusters of varying shapes and densities. Spatial clustering has been widely used many important societal applications, including public health and safety, transportation, environment, etc. The problem is challenging because many application domains have low-tolerance to false positives (e.g., falsely claiming a crime cluster in a community can have serious negative impacts on the residents) and clusters often have irregular shapes. In related work, the spatial scan statistic is a popular technique that can detect significant clusters but it requires clusters to have certain predefined shapes (e.g., circles, rings). In contrast, density-based methods (e.g., DBSCAN) can find clusters of arbitrary shape efficiently but do not consider statistical significance, making them susceptible to spurious patterns. To address these limitations, we first propose a modeling of statistical significance in DBSCAN based clustering. Then, we propose a baseline Monte Carlo method to estimate the significance of clusters and a Dual-Convergence algorithm to accelerate the computation. Experiment results show that significant DBSCAN is very effective in removing chance patterns and the Dual-Convergence algorithm can greatly reduce execution time.
[1]
Andrew W. Moore,et al.
Rapid detection of significant spatial clusters
,
2004,
KDD.
[2]
Shashi Shekhar,et al.
A Nondeterministic Normalization based Scan Statistic (NN-scan) towards Robust Hotspot Detection: A Summary of Results
,
2019,
SDM.
[3]
Shashi Shekhar,et al.
Ring-Shaped Hotspot Detection: A Summary of Results
,
2014,
2014 IEEE International Conference on Data Mining.
[4]
M. Kulldorff.
A spatial scan statistic
,
1997
.
[5]
Shai Ben-David,et al.
Measures of Clustering Quality: A Working Set of Axioms for Clustering
,
2008,
NIPS.
[6]
Shashi Shekhar,et al.
Transdisciplinary Foundations of Geospatial Data Science
,
2017,
ISPRS Int. J. Geo Inf..
[7]
Hans-Peter Kriegel,et al.
OPTICS: ordering points to identify the clustering structure
,
1999,
SIGMOD '99.
[8]
Hans-Peter Kriegel,et al.
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
,
1996,
KDD.
[9]
Arthur Zimek,et al.
Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection
,
2015,
ACM Trans. Knowl. Discov. Data.
[10]
Andrew W. Moore,et al.
Detection of spatial and spatio-temporal clusters
,
2006
.