Clustering Approach for Data Lake Based on Medoid’s Ranking Strategy

A number of conventional clustering algorithms suffer from poor scalability, especially for data lake. Thus many modified clustering algorithms have been proposed to speed up these conventional algorithms based on the employment of data sampling techniques. However, these representations require the number of clusters to proceed to centroid selection for final data clustering. To address this limitation, this paper develops a two-phase clustering-based methodology. In the first phase, rather than attempting to construct a random sampling, we define a novel approach that computes plausible sample points, uses them as centroids for the final clusters. To speedup our clustering algorithm in the second phase we propose a parallelization scheme in conjunction with a Spark parallel processing implementation. Computational experiments reveal that the Global sampling method is more effective in terms of both quality and stability compared to the most popular K-means algorithm for the same parameter settings.

[1]  Kotagiri Ramamohanarao,et al.  Approximate pairwise clustering for large data sets via sampling plus extension , 2011, Pattern Recognit..

[2]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[3]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[4]  Mary Roth,et al.  Data Wrangling: The Challenging Yourney from the Wild to the Lake , 2015, CIDR.

[5]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[6]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[7]  Sanguthevar Rajasekaran Efficient parallel hierarchical clustering algorithms , 2005, IEEE Transactions on Parallel and Distributed Systems.

[8]  Subrata Saha,et al.  Novel Algorithms for Big Data Analytics , 2017 .

[9]  Fred W. Glover,et al.  A Tabu search based clustering algorithm and its parallel implementation on Spark , 2017, Appl. Soft Comput..

[10]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[11]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[12]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[13]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[14]  Sanguthevar Rajasekaran,et al.  A Novel Deterministic Sampling Technique to Speedup Clustering Algorithms , 2013, ADMA.

[15]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[16]  Rong Jin,et al.  Sparse Kernel Clustering of Massive High-Dimensional Data sets with Large Number of Clusters , 2015, PIKM@CIKM.

[17]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[19]  Charu C. Aggarwal,et al.  A Survey of Stream Clustering Algorithms , 2018, Data Clustering: Algorithms and Applications.