On clustering large number of data streams

Data streams and their applications appear in several fields such as physics, finance, medicine, environmental science, etc. As sensor technology improves, sensor data rates continue to increase. Consequently, analyzing data streams becomes ever more challenging. Fast online response is a must for applications that involve multiple data streams, especially when the number of data streams is large. This paper proposes an efficient clustering technique called Multi-way Grid-based join algorithm MG-join to find clusters in multiple data streams. The proposed algorithm uses a Discrete Fourier Transformation DFT to reduce the dimensionality of the streams. Each stream is represented by a point in a multi-dimensional grid in the frequency domain. The MG-join algorithm finds the different clusters in multiple data streams in the frequency domain. Moreover, this paper proposes an incremental update mechanism to avoid the recalculation of DFT coefficients when new readings arrive and thus minimizes the processing time. Experiments on synthetic data streams show that the proposed clustering technique is much faster than traditional clustering techniques and yet its accuracy is as good as that of the traditional clustering techniques. This makes the proposed technique suitable for sensors network environment where computing and power capabilities are limited.

[1]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[2]  Mohammad Al Hasan,et al.  Under consideration for publication in Knowledge and Information Systems SPARCL: An Effective and Efficient Algorithm for Mining Arbitrary Shape-based Clusters 1 , 2022 .

[3]  Christos Faloutsos,et al.  BRAID: stream mining through group lag correlations , 2005, SIGMOD '05.

[4]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[5]  Aoying Zhou,et al.  Tracking clusters in evolving data streams over sliding windows , 2008, Knowledge and Information Systems.

[6]  Aoying Zhou,et al.  Distributed Data Stream Clustering: A Fast EM-based Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[7]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[8]  Christos Faloutsos,et al.  Adaptive, Hands-Off Stream Mining , 2003, VLDB.

[9]  Markus H. Gross,et al.  Data streaming in telepresence environments , 2005, IEEE Transactions on Visualization and Computer Graphics.

[10]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[11]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[12]  Walid G. Aref,et al.  Detection and Tracking of Discrete Phenomena in Sensor-Network Databases , 2005, SSDBM.

[13]  Walid G. Aref,et al.  Scheduling for shared window joins over data streams , 2003, VLDB.

[14]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[15]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Olfa Nasraoui,et al.  Robust Clustering for Tracking Noisy Evolving Data Streams , 2006, SDM.

[17]  Ming-Syan Chen,et al.  Adaptive Clustering for Multiple Evolving Streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[18]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[19]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[20]  Richard Cole,et al.  Fast window correlations over uncooperative time series , 2005, KDD '05.

[21]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[22]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[23]  Moustafa A. Hammad Efficient pipelined execution of sliding window queries over data streams , 2003 .

[24]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[25]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[26]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[27]  Walid G. Aref,et al.  Stream window join: tracking moving objects in sensor-network databases , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[28]  Silvia Nittel,et al.  Scaling clustering algorithms for massive data sets using data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[29]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[30]  Won Suk Lee,et al.  Statistical grid-based clustering over data streams , 2004, SGMD.

[31]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[32]  Viggo Kann,et al.  Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications , 2004 .

[33]  Jianzhong Li,et al.  SlidingWindow based Multi-Join Algorithms over Distributed Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[34]  Jeffrey F. Naughton,et al.  Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[35]  Eyke Hüllermeier,et al.  Online clustering of parallel data streams , 2006, Data Knowl. Eng..

[36]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[37]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.

[38]  Dimitris K. Tasoulis,et al.  Unsupervised Clustering In Streaming Data , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[39]  Lukasz Golab,et al.  Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams , 2003, VLDB.

[40]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[41]  Johannes Gehrke,et al.  Query Processing in Sensor Networks , 2003, CIDR.

[42]  Christos Faloutsos,et al.  Stream Monitoring under the Time Warping Distance , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[43]  Anne Rogers,et al.  Hancock: a language for extracting signatures from data streams , 2000, KDD '00.

[44]  Charu C. Aggarwal On classification and segmentation of massive audio data streams , 2008, Knowledge and Information Systems.

[45]  Mario A. Nascimento,et al.  A Distributed Algorithm for Joins in Sensor Networks , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[46]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[47]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[48]  Eyke Hüllermeier,et al.  Fuzzy Clustering of Parallel Data Streams , 2007 .