S-RASTER: contraction clustering for evolving data streams

Contraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost of decreased precision. Yet, RASTER is limited to batch processing and unable to identify clusters that only exist temporarily. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step within a sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. Our evaluation shows that competing algorithms are at least 50% slower. Furthermore, S-RASTER shows good qualitative results, based on standard metrics. It is very well suited to real-world scenarios where clustering does not happen continually but only periodically.

[1]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[2]  Marco Mellia,et al.  Large-scale network traffic monitoring with DBStream, a system for rolling big data analysis , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[3]  Iván Machón González,et al.  Self-organizing map and clustering for wastewater treatment monitoring , 2004, Eng. Appl. Artif. Intell..

[4]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[5]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[6]  Shen Furao,et al.  An enhanced self-organizing incremental neural network for online unsupervised learning , 2007, Neural Networks.

[7]  Lutgarde M. C. Buydens,et al.  Clustering multispectral images: a tutorial , 2005 .

[8]  M. Wing,et al.  Consumer-Grade Global Positioning System (GPS) Accuracy and Reliability , 2005 .

[9]  Mats Jirstrand,et al.  Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass , 2019, ArXiv.

[10]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[11]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[12]  Jing Li,et al.  GPS accuracy estimation using map matching techniques: Applied to vehicle positioning and odometer calibration , 2006, Comput. Environ. Urban Syst..

[13]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[14]  M. Cugmas,et al.  On comparing partitions , 2015 .

[15]  Jing Gao,et al.  An Incremental Data Stream Clustering Algorithm Based on Dense Units Detection , 2005, PAKDD.

[16]  Michael Hahsler,et al.  Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R , 2017 .

[17]  Li Tu,et al.  Stream data clustering based on grid density and attraction , 2009, TKDD.

[18]  Melody Y. Kiang,et al.  Extending the Kohonen self-organizing map networks for clustering analysis , 2002 .

[19]  Chen Jia,et al.  A Grid and Density-Based Clustering Algorithm for Processing Data Stream , 2008, 2008 Second International Conference on Genetic and Evolutionary Computing.

[20]  Umi Kalthum Ngah,et al.  Adaptive fuzzy moving K-means clustering algorithm for image segmentation , 2009, IEEE Transactions on Consumer Electronics.

[21]  J. Alberto Espinosa,et al.  Big Data: Issues and Challenges Moving Forward , 2013, 2013 46th Hawaii International Conference on System Sciences.

[22]  Peter Willett,et al.  What is a tutorial , 2013 .

[23]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[24]  H.P. Ng,et al.  Medical Image Segmentation Using K-Means Clustering and Improved Watershed Algorithm , 2006, 2006 IEEE Southwest Symposium on Image Analysis and Interpretation.

[25]  Biao Hou,et al.  Using Combined Difference Image and $k$ -Means Clustering for SAR Image Change Detection , 2014, IEEE Geoscience and Remote Sensing Letters.

[26]  Suresh Venkatasubramanian,et al.  Clustering on Streams , 2009, Encyclopedia of Database Systems.

[27]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[28]  Michael Hahsler,et al.  Clustering Data Streams Based on Shared Density between Micro-Clusters , 2016, IEEE Transactions on Knowledge and Data Engineering.

[29]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[30]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[31]  Fernando Bação,et al.  Self-organizing Maps as Substitutes for K-Means Clustering , 2005, International Conference on Computational Science.

[32]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[33]  Giannis Verginadis,et al.  A survey on data storage and placement methodologies for Cloud-Big Data ecosystem , 2019, Journal of Big Data.

[34]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[35]  Teuvo Kohonen,et al.  Essentials of the self-organizing map , 2013, Neural Networks.

[36]  Per Enge,et al.  The World’s first GPS MOOC and Worldwide Laboratory using Smartphones , 2015 .

[37]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[38]  Haiyang Li,et al.  Dynamic particle swarm optimization and K-means clustering algorithm for image segmentation , 2015 .

[39]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[40]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..