A Density-based Preprocessing Technique to Scale Out Clustering

Clustering big data is a challenging task, because the majority of high-quality clustering algorithms do not scale well with respect to the data set cardinality. To tackle the scalability problem, we propose a general-purpose density-based preprocessing technique, called SCOUT, implemented in the Spark framework. It allows compacting the original data by means of a set of representative points, while still preserving the original data distribution and density information. This small set of representative points may become the input to almost any clustering algorithm. Thus, also complex, high-quality in-memory algorithms can be applied. A thorough experimental evaluation shows that the proposed approach is efficient and at the same time effective.