Combining Parallel Self-Organizing Maps and K-Means to Cluster Distributed Data

Clustering is the process of discovering groups within multidimensional data, based on similarities, with a minimal knowledge of their structure. In previous works, we presented an algorithm (partSOM) to cluster distributed datasets, based on self-organizing maps (SOM). This work extends this approach presenting a strategy for efficient cluster analysis in distributed databases using SOM and K-means. The proposed strategy applies SOM algorithm separately in each distributed dataset, relative to database vertical partitions, to obtain a representative subset of each local dataset. In the sequence, these representative subsets are sent to a central site, which performs a fusion of the partial results and applies SOM and K-means algorithms to obtain a final result. Experimental results are compared with traditional SOM and partSOM approaches for different datasets.

[1]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[2]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[3]  José Alfredo Ferreira Costa,et al.  Parallel self-organizing maps with application in clustering distributed data , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[4]  Xiaofeng Zhang,et al.  Mining Local Data Sources For Learning Global Cluster Models , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[5]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[6]  José Alfredo Ferreira Costa,et al.  Clustering of complex shaped data sets via Kohonen maps and mathematical morphology , 2001, Data Mining and Knowledge Discovery: Theory, Tools, and Technology.

[7]  Mehmed Kantardzic,et al.  Data Mining: Concepts, Models, Methods, and Algorithms , 2002 .

[8]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[9]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[10]  Rebecca N. Wright,et al.  A New Privacy-Preserving Distributed k-Clustering Algorithm , 2006, SDM.

[11]  Chris Clifton,et al.  Privacy-Preserving Kth Element Score over Vertically Partitioned Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12]  Bin Zhang,et al.  Distributed data clustering can be efficient and exact , 2000, SKDD.

[13]  Alfred Ultsch,et al.  Knowledge Extraction from Self-Organizing Neural Networks , 1993 .

[14]  Osmar R. Zaïane,et al.  Achieving Privacy Preservation when Sharing Data for Clustering , 2004, Secure Data Management.

[15]  Zengyou He,et al.  Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach , 2005, ArXiv.

[16]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.