The huge amount of information stored in databases owned by corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the area of knowledge discovery and data mining. Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and other business applications. Although researchers have been working on clustering algorithms for decades, and a lot of algorithms for clustering have been developed, there is still no efficient algorithm for clustering very large databases and high dimensional data. As an outstanding representative of clustering algorithms, DBSCAN algorithm shows good performance in spatial data clustering. However, for large spatial databases, DBSCAN requires large volume of memory support and could incur substantial I/O costs because it operates directly on the entire database. In this paper, several approaches are proposed to scale DBSCAN algorithm to large spatial databases. To begin with, a fast DBSCAN algorithm is developed, which considerably speeds up the original DBSCAN algorithm. Then a sampling based DBSCAN algorithm, a partitioning-based DBSCAN algorithm, and a parallel DBSCAN algorithm are introduced consecutively. Following that, based on the above-proposed algorithms, a synthetic algorithm is also given. Finally, some experimental results are given to demonstrate the effectiveness and efficiency of these algorithms.
[1]
Sudipto Guha,et al.
CURE: an efficient clustering algorithm for large databases
,
1998,
SIGMOD '98.
[2]
Jiong Yang,et al.
STING: A Statistical Information Grid Approach to Spatial Data Mining
,
1997,
VLDB.
[3]
Dimitrios Gunopulos,et al.
Automatic subspace clustering of high dimensional data for data mining applications
,
1998,
SIGMOD '98.
[4]
Philip S. Yu,et al.
Data Mining: An Overview from a Database Perspective
,
1996,
IEEE Trans. Knowl. Data Eng..
[5]
Aidong Zhang,et al.
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases
,
1998,
VLDB.
[6]
Jiawei Han,et al.
Efficient and Effective Clustering Methods for Spatial Data Mining
,
1994,
VLDB.
[7]
Tian Zhang,et al.
BIRCH: an efficient data clustering method for very large databases
,
1996,
SIGMOD '96.
[8]
Hans-Peter Kriegel,et al.
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
,
1996,
KDD.
[9]
Hans-Peter Kriegel,et al.
The R*-tree: an efficient and robust access method for points and rectangles
,
1990,
SIGMOD '90.