Parallel Algorithms for Clustering High-Dimensional Large-Scale Datasets

Clustering techniques for large scale and high dimensional data sets have found great interest in recent literature. Such data sets are found both in scientific and commercial applications. Clustering is the process of identifying dense regions in a sparse multi-dimensional data set. Several clustering techniques proposed earlier either lack in scalability to a very large set of dimensions or to a large data set. Many of them require key user inputs making it hard to be useful for real world data sets or fail to represent the generated clusters in a intuitive way. We have designed and implemented, pMAFIA, a density and grid based clustering algorithm wherein a multi-dimensional space is divided into finer grids and the dense regions found are merged together to identify the clusters. For large data sets with a large number of dimensions fine division of the multi-dimensional space leads to an enormous amount of computation. We have introduced an adaptive grid framework which not only reduces the computation vastly by forming grids based on the data distribution, but also improves the quality of clustering. Clustering algorithms also need to explore clusters in a subspace of the total data space. We have implemented a new bottom up algorithm which explores all possible subspaces to identify the embedded clusters. Further our framework requires no user input, making pMAFIA a completely unsupervised data mining algorithm. Finally, we have also introduced parallelism in the clustering process, which enables our data mining tool to scale up to massive data sets and large set of dimensions. Data parallelism coupled with task parallelism have shown to yield the best parallelization results on a diverse set of synthetic and real data sets.

[1]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[2]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[3]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[4]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[7]  Jürgen Eichenauer-Herrmann,et al.  A new inversive congruential pseudorandom number generator with power of two modulus , 1992, TOMC.

[8]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[9]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[10]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[11]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[12]  R. Michalski,et al.  Learning from Observation: Conceptual Clustering , 1983 .

[13]  R. Ng,et al.  Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[14]  Dean P. Foster,et al.  Clustering Methods for Collaborative Filtering , 1998, AAAI 1998.

[15]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.