论文信息 - A Density-based Preprocessing Technique to Scale Out Clustering

A Density-based Preprocessing Technique to Scale Out Clustering

Clustering big data is a challenging task, because the majority of high-quality clustering algorithms do not scale well with respect to the data set cardinality. To tackle the scalability problem, we propose a general-purpose density-based preprocessing technique, called SCOUT, implemented in the Spark framework. It allows compacting the original data by means of a set of representative points, while still preserving the original data distribution and density information. This small set of representative points may become the input to almost any clustering algorithm. Thus, also complex, high-quality in-memory algorithms can be applied. A thorough experimental evaluation shows that the proposed approach is efficient and at the same time effective.

Elena Baralis | Paolo Garza | Eliana Pastor

[1] Dimitrios Gunopulos,et al. Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[2] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[3] Vipin Kumar,et al. Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[4] Bi-Ru Dai,et al. Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[5] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[6] L. Hubert,et al. Comparing partitions , 1985 .

[7] Matteo Dell'Amico,et al. NG-DBSCAN: Scalable Density-Based Clustering for Arbitrary Data , 2016, Proc. VLDB Endow..

[8] Younghoon Kim,et al. DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce , 2014, Inf. Syst..

[9] Di Ma,et al. MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[10] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[12] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.