Rough-DBSCAN: A fast hybrid density based clustering method for large data sets

Density based clustering techniques like DBSCAN are attractive because it can find arbitrary shaped clusters along with noisy outliers. Its time requirement is O(n^2) where n is the size of the dataset, and because of this it is not a suitable one to work with large datasets. A solution proposed in the paper is to apply the leaders clustering method first to derive the prototypes called leaders from the dataset which along with prototypes preserves the density information also, then to use these leaders to derive the density based clusters. The proposed hybrid clustering technique called rough-DBSCAN has a time complexity of O(n) only and is analyzed using rough set theory. Experimental studies are done using both synthetic and real world datasets to compare rough-DBSCAN with DBSCAN. It is shown that for large datasets rough-DBSCAN can find a similar clustering as found by the DBSCAN, but is consistently faster than DBSCAN. Also some properties of the leaders as prototypes are formally established.

[1]  Ming-Syan Chen,et al.  Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging , 2005, IEEE Trans. Knowl. Data Eng..

[2]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[3]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[4]  P. Viswanath,et al.  l-DBSCAN : A Fast Hybrid Density Based Clustering Method , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[5]  M. Narasimha Murty,et al.  Tree structure for efficient data mining using rough sets , 2003, Pattern Recognit. Lett..

[6]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[7]  David G. Stork,et al.  Pattern Classification , 1973 .

[8]  Pawan Lingras,et al.  Interval Set Clustering of Web Users with Rough K-Means , 2004, Journal of Intelligent Information Systems.

[9]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..

[10]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[11]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[12]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  Teuvo Kohonen,et al.  Median strings , 1985, Pattern Recognit. Lett..

[15]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[16]  Simon C. K. Shiu,et al.  Combining feature reduction and case selection in building CBR classifiers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[17]  M. Narasimha Murty,et al.  An adaptive rough fuzzy single pass algorithm for clustering large data sets , 2003, Pattern Recognit..

[18]  James Nga-Kwok Liu,et al.  A rough set-based case-based reasoner for text categorization , 2006, Int. J. Approx. Reason..

[19]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[20]  Chee Keong Kwoh,et al.  On the Two-level Hybrid Clustering Algorithm , 2004 .

[21]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[22]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[23]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[24]  M. Narasimha Murty,et al.  A rough fuzzy approach to web usage categorization , 2004, Fuzzy Sets Syst..

[25]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.