Similarity Grouping in Big Data Systems

Distributed computing technologies have opened the door for a wide range of organizations to analyze massive amounts of data. Grouping (fast but based on exact semantics) and clustering (relatively slow but based on similarity-aware semantics) are among the most useful data analysis operations. Previous work introduced the Similarity Grouping (SG) operator, which aims to integrate the best features of grouping and clustering, i.e., fast execution times and similarity-aware grouping semantics. The SG operators, however, were proposed for single node relational database systems. This paper introduces the Distributed Similarity Grouping (DSG) operator, a highly parallel operator for identifying similarity groups in big datasets. DSG enables the identification of groups where all the elements are within a given threshold from each other. This paper presents DSG’s design details, implementation guidelines on Spark and Hadoop (two important Big Data systems), and extensive performance and scalability evaluation.

[1]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[2]  Walid G. Aref,et al.  Exploiting similarity-aware grouping in decision support systems , 2009, EDBT '09.

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[7]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[8]  Walid G. Aref,et al.  Similarity Group-By , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[9]  Walid G. Aref,et al.  Similarity Group-by Operators for Multi-Dimensional Relational Data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[10]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[11]  Soumaya Louhichi,et al.  A density based algorithm for discovering clusters with varied density , 2014, 2014 World Congress on Computer Applications and Information Systems (WCCAIS).

[12]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[13]  Anjan K. Koundinya,et al.  MapReduce Design of K-Means Clustering Algorithm , 2013, 2013 International Conference on Information Science and Applications (ICISA).

[14]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[15]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[16]  Geoff Holmes,et al.  Clustering Large Datasets Using Cobweb and K-Means in Tandem , 2004, Australian Conference on Artificial Intelligence.

[17]  Gordon S. Blair,et al.  A generic component model for building systems software , 2008, TOCS.