Distributed clustering of ubiquitous data streams

Nowadays information is generated and gathered from distributed streaming data sources, stressing communications and computing infrastructure, making it hard to transmit, compute, and store. Knowledge discovery from ubiquitous data streams has become a major goal for all sorts of applications, mostly based on unsupervised techniques such as clustering. Two subproblems exist: clustering streaming data observations and clustering streaming data sources. The former searches for dense regions of the data space, identifying hot spots where data sources tend to produce data, while the latter finds groups of sources that behave similarly over time. In order to better assess the current status of this topic, this article presents a thorough review on distributed algorithms addressing either of the subproblems. We characterize clustering algorithms for ubiquitous data streams, discussing advantages and disadvantages of distributed procedures. Overall, distributed stream clustering methods improve communication ratios, processing speed, and resources consumption, while achieving similar clustering validity as the centralized counterparts. WIREs Data Mining Knowl Discov 2014, 4:38–54. doi: 10.1002/widm.1109

[1]  Nesime Tatbul,et al.  Data Stream Processing , 2009, Encyclopedia of Database Systems.

[2]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Graham Cormode,et al.  Conquering the Divide: Continuous Clustering of Distributed Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[5]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[6]  Ran Wolff,et al.  A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems , 2009, IEEE Transactions on Knowledge and Data Engineering.

[7]  João Gama,et al.  Requirements for Clustering Streaming Sensors , 2008 .

[8]  João Gama,et al.  A system for analysis and prediction of electricity-load streams , 2009, Intell. Data Anal..

[9]  Myra Spiliopoulou,et al.  MONIC: modeling and monitoring cluster transitions , 2006, KDD '06.

[10]  Srujana Merugu,et al.  A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[11]  Lei Liu,et al.  MobiMine: monitoring the stock market from a PDA , 2002, SKDD.

[12]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[13]  Mohamed Medhat Gaber,et al.  Resource-aware knowledge discovery in data streams , 2004 .

[14]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[15]  João Gama,et al.  Hierarchical Clustering of Time-Series Data Streams , 2008, IEEE Transactions on Knowledge and Data Engineering.

[16]  Zhang Qiang,et al.  WINP: a window-based incremental and parallel clustering algorithm for very large databases , 2005, 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05).

[17]  Mauricio Marín,et al.  An empirical evaluation of a distributed clustering-based index for metric space databases , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[18]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[19]  Deepayan Chakrabarti,et al.  Evolutionary clustering , 2006, KDD '06.

[20]  Mohamed Medhat Gaber,et al.  Clustering Distributed Time Series in Sensor Networks , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[21]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[22]  Ming-Syan Chen,et al.  Adaptive Clustering for Multiple Evolving Streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[23]  Mung Chiang,et al.  The value of clustering in distributed estimation for sensor networks , 2005, 2005 International Conference on Wireless Networks, Communications and Mobile Computing.

[24]  João Gama,et al.  Knowledge Discovery for Sensor Network Comprehension , 2010 .

[25]  Geoffrey I. Webb Discovering significant rules , 2006, KDD '06.

[26]  Robert Nowak,et al.  Distributed optimization in sensor networks , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[27]  Paul S. Bradley,et al.  Clustering very large databases using EM mixture models , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[28]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[29]  Zhang Qiang,et al.  SMIP an Incremental and Parallel Clustering Algorithm Based on Statistics and Morphology , 2006, 2006 Canadian Conference on Electrical and Computer Engineering.

[30]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[31]  Eyke Hüllermeier,et al.  Online clustering of parallel data streams , 2006, Data Knowl. Eng..

[32]  M. Moy,et al.  Using hierarchical clustering methods to classify motor activities of COPD patients from wearable sensor data , 2005, Journal of NeuroEngineering and Rehabilitation.

[33]  Tsuyoshi Idé Why Does Subsequence Time-Series Clustering Produce Sine Waves? , 2006, PKDD.

[34]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[35]  Robert D. Nowak,et al.  Distributed EM algorithms for density estimation and clustering in sensor networks , 2003, IEEE Trans. Signal Process..

[36]  Daniel Barbará,et al.  Requirements for clustering data streams , 2002, SKDD.

[37]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[38]  João Gama,et al.  Clustering Techniques in Sensor Networks , 2007 .

[39]  Hillol Kargupta,et al.  Distributed Data Mining: Algorithms, Systems, and Applications , 2003 .

[40]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[41]  R. Mantegna Hierarchical structure in financial markets , 1998, cond-mat/9802256.

[42]  Shenghuo Zhu,et al.  A survey on wavelet applications in data mining , 2002, SKDD.

[43]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[44]  Clement T. Yu,et al.  Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping , 2003, IEEE Trans. Knowl. Data Eng..

[45]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2004, Knowledge and Information Systems.

[46]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[47]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[48]  Philip S. Yu,et al.  A framework for resource-aware knowledge discovery in data streams: a holistic approach with its application to clustering , 2006, SAC '06.

[49]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[50]  Alfredo Cuzzocrea,et al.  Intelligent Techniques for Warehousing and Mining Sensor Network Data , 2009 .

[51]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[52]  João Gama,et al.  Clustering Distributed Sensor Data Streams , 2008, ECML/PKDD.

[53]  Aoying Zhou,et al.  Distributed Data Stream Clustering: A Fast EM-based Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[54]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[55]  Orhan Dagdeviren,et al.  Tracking Fast Moving Targets in Wireless Sensor Networks , 2010 .

[56]  Matthias Klusch,et al.  Distributed Clustering Based on Sampling Local Density Estimates , 2003, IJCAI.

[57]  Won Suk Lee,et al.  Statistical grid-based clustering over data streams , 2004, SGMD.

[58]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[59]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[60]  João Gama,et al.  Bipartite Graphs for Monitoring Clusters Transitions , 2010, IDA.

[61]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[62]  Ahmed Helmy,et al.  Active query forwarding in sensor networks , 2005, Ad Hoc Networks.

[63]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[64]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[65]  Ujjwal Maulik,et al.  Clustering distributed data streams in peer-to-peer environments , 2006, Inf. Sci..

[66]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[67]  Katharina Morik,et al.  Localized Alternative Cluster Ensembles for Collaborative Structuring , 2006, ECML.

[68]  Qi Zhang,et al.  Approximate Clustering on Distributed Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[69]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[70]  Giovanni Giuffrida,et al.  Dynamic Spatial Clustering for Intelligent Mobile Information Sharing and Dissemination , 1999, SSD.

[71]  Chi-Yin Chow,et al.  Group-based cooperative cache management for mobile clients in a mobile environment , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[72]  Kun Liu,et al.  VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring , 2004, SDM.

[73]  Mohamed Medhat Gaber,et al.  A fuzzy approach for interpretation of ubiquitous data stream clustering and its application in road safety , 2007, Intell. Data Anal..

[74]  Mohamed Medhat Gaber,et al.  Knowledge discovery from data streams , 2009, IDA 2009.

[75]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[76]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[77]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .