A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis

Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison, both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. In addition, we highlighted the set of clustering algorithms that are the best performing for big data.

[1]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[2]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[3]  Zahir Tari,et al.  PPFSCADA: Privacy preserving framework for SCADA data publishing , 2014, Future Gener. Comput. Syst..

[4]  Mo Hai 大数据聚类算法综述 (Survey of Clustering Algorithms for Big Data) , 2016, 计算机科学.

[5]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8]  Zahir Tari,et al.  A Framework for Improving the Accuracy of Unsupervised Intrusion Detection for SCADA Systems , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[9]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[10]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[11]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[12]  Makoto Takizawa,et al.  A Survey on Clustering Algorithms for Wireless Sensor Networks , 2010, 2010 13th International Conference on Network-Based Information Systems.

[13]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[14]  Andrew W. Moore,et al.  Architecture of a network monitor , 2003 .

[15]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[16]  Salvatore J. Stolfo,et al.  Cost-based modeling for fraud and intrusion detection: results from the JAM project , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[17]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[18]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[19]  Marko Grobelnik,et al.  A SURVEY OF ONTOLOGY EVALUATION TECHNIQUES , 2005 .

[20]  Douglas Hayes Fisher,et al.  Knowledge acquisition via incremental conceptual clustering : a dussertation submitted in partial satisfaction of the requirements for the degree doctor of philosophy in information and computer science , 1987 .

[21]  Pat Langley,et al.  Models of Incremental Concept Formation , 1990, Artif. Intell..

[22]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[23]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[24]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[25]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[26]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[27]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[28]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[29]  Zahir Tari,et al.  SCADAVT-A framework for SCADA security testbed based on virtualization technology , 2013, 38th Annual IEEE Conference on Local Computer Networks.

[30]  Christopher Leckie,et al.  Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis , 2006, Networking.

[31]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[32]  Marimuthu Palaniswami,et al.  Labelled data collection for anomaly detection in wireless sensor networks , 2010, 2010 Sixth International Conference on Intelligent Sensors, Sensor Networks and Information Processing.

[33]  Christopher Leckie,et al.  An Efficient Clustering Scheme to Exploit Hierarchical Data in Network Traffic Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[34]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[35]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[36]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[37]  Ian Witten,et al.  Data Mining , 2000 .

[38]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[39]  M. Cugmas,et al.  On comparing partitions , 2015 .

[40]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[41]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[42]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[43]  Zahir Tari,et al.  Toward an efficient and scalable feature selection approach for internet traffic classification , 2013, Comput. Networks.

[44]  Marina Meila,et al.  An Experimental Comparison of Several Clustering and Initialization Methods , 1998, UAI.