Scalable single linkage hierarchical clustering for big data

Personal computing technologies are everywhere; hence, there are an abundance of staggeringly large data sets-the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs nearly a petabyte of data per day. Thus, there is a pertinent need for systems by which one can elucidate the similarity and dissimilarity among and between groups in these big data sets. Clustering is one way to find these groups. In this paper, we extend the scalable Visual Assessment of Tendency (sVAT) algorithm to return single-linkage partitions of big data sets. The sVAT algorithm is designed to provide visual evidence of the number of clusters in unloadable (big) data sets. The extension we describe for sVAT enables it to also then efficiently return the data partition as indicated by the visual evidence. The computational complexity and storage requirements of sVAT are (usually) significantly less than the O(n2) requirement of the classic single-linkage hierarchical algorithm. We show that sVAT is a scalable instantiation of single-linkage clustering for data sets that contain c compact-separated clusters, where c ≪ n; n is the number of objects. For data sets that do not contain compact-separated clusters, we show that sVAT produces a good approximation of single-linkage partitions. Experimental results are presented for both synthetic and real data sets.

[1]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[2]  James C. Bezdek,et al.  Complexity reduction for "large image" processing , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[3]  James M. Keller,et al.  Dunn’s cluster validity index as a contrast measure of VAT images , 2008, 2008 19th International Conference on Pattern Recognition.

[4]  James C. Bezdek,et al.  Scalable visual assessment of cluster tendency for large data sets , 2006, Pattern Recognit..

[5]  R. Prim Shortest connection networks and some generalizations , 1957 .

[6]  Leland Wilkinson,et al.  The History of the Cluster Heat Map , 2009 .

[7]  James C. Bezdek,et al.  Extending fuzzy and probabilistic clustering to very large data sets , 2006, Comput. Stat. Data Anal..

[8]  Rong Jin,et al.  Efficient Kernel Clustering Using Random Fourier Features , 2012, 2012 IEEE 12th International Conference on Data Mining.

[9]  Rong Jin,et al.  Approximate kernel k-means: solution to large scale kernel clustering , 2011, KDD.

[10]  W. M. Flinders Petrie,et al.  Sequences in Prehistoric Remains , 1899 .

[11]  James M. Keller,et al.  Is VAT really single linkage in disguise? , 2009, Annals of Mathematics and Artificial Intelligence.

[12]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[13]  Peter J. Huber,et al.  Massive Data Sets , 2011 .

[14]  Lawrence O. Hall,et al.  Single Pass Fuzzy C Means , 2007, 2007 IEEE International Fuzzy Systems Conference.

[15]  Rong Jin,et al.  Speedup of fuzzy and possibilistic kernel c-means for large-scale clustering , 2011, 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011).

[16]  Marimuthu Palaniswami,et al.  Fuzzy c-Means Algorithms for Very Large Data , 2012, IEEE Transactions on Fuzzy Systems.

[17]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[18]  J. Bezdek,et al.  VAT: a tool for visual assessment of (cluster) tendency , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[19]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .