A fully distributed clustering algorithm based on fractal dimension

Clustering or grouping of similar objects is one of the most widely used procedures in data mining, which has received enormous attentions and many methods have been proposed in these recent decades. However these traditional clustering algorithms require all the data objects to be located at one single site where it is analyzed. And such limitation cannot face the challenge as nowadays monstrous sizes of data sets are often stored on different independently working computers connected to each other via local or wide area networks instead of one single site. Therefore in this paper, we propose a fully distributed clustering algorithm, called a fully distributed clustering based on fractal dimension (FDCFD), which enables each site to collaborate in forming a global clustering model with low communication cost. The main idea behind FDCFD is via calculating fractal dimension to group points in a cluster in such a way that none of the points in the cluster changes the cluster's fractal dimension radically. In our theoretical analysis, we will demonstrate that our approach can work very well for clustering data that is inherently distributed, collect information spread over several local sites to form a global clustering meanwhile without communication costs and delays for transmitting.

[1]  Anand Sivasubramaniam,et al.  PENS: an algorithm for density-based clustering in peer-to-peer systems , 2006, InfoScale '06.

[2]  James Theiler,et al.  Contiguity-enhanced k-means clustering algorithm for unsupervised multispectral image segmentation , 1997, Optics & Photonics.

[3]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[4]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[5]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[6]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[7]  Hans-Peter Kriegel,et al.  Scalable Density-Based Distributed Clustering , 2004, PKDD.

[8]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[9]  Kenneth Falconer,et al.  Fractal Geometry: Mathematical Foundations and Applications , 1990 .

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  Ujjwal Maulik,et al.  Clustering distributed data streams in peer-to-peer environments , 2006, Inf. Sci..

[13]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[14]  Ping Chen,et al.  Using the fractal dimension to cluster datasets , 2000, KDD '00.

[15]  Michalis Vazirgiannis,et al.  Clustering algorithms and validity measures , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[16]  L. Liebovitch,et al.  A fast algorithm to determine fractal dimensions by box counting , 1989 .

[17]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[18]  Bin Zhang,et al.  Distributed data clustering can be efficient and exact , 2000, SKDD.

[19]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[20]  M.G.P. Prasad,et al.  An efficient fractals-based algorithm for clustering , 2003, TENCON 2003. Conference on Convergent Technologies for Asia-Pacific Region.

[21]  Nagiza F. Samatova,et al.  RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets , 2002, Distributed and Parallel Databases.

[22]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.