A SURVEY OF STREAM DATA MINING

Abstract – At present a growing number of applications that generate massive streams of data need intelligent data processing and online analysis. Real-time surveillance systems, telecommunication systems, sensor networks and other dynamic environments are such examples. The imminent need for turning such data into useful information and knowledge augments the development of systems, algorithms and frameworks that address streaming challenges. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. In this paper, we present the theoretical foundations of data stream analysis and identify potential directions of future research. Mining data stream techniques are being critically reviewed. Index terms —data streams, data mining, review 1. INTRODUCTION Recently a new class of emerging applications has become widely recognized: applications in which data is generated at very high rates in the form of transient

[1]  Jesús S. Aguilar-Ruiz,et al.  Discovering decision rules from numerical data streams , 2004, SAC '04.

[2]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[3]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[4]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[5]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[6]  Qiang Ding,et al.  Decision tree classification of spatial data streams using Peano Count Trees , 2002, SAC '02.

[7]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[8]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[9]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[10]  Johannes Gehrke,et al.  Mining data streams under block evolution , 2002, SKDD.

[11]  Rajeev Motwani,et al.  Load Shedding Techniques for Data Stream Systems , 2003 .

[12]  Mohamed Medhat Gaber,et al.  Adaptive mining techniques for data streams using algorithm output granularity , 2003 .

[13]  Mohamed Medhat Gaber,et al.  Towards an Adaptive Approach for Mining Data Streams in Resource Constrained Environments , 2004, DaWaK.

[14]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[15]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[16]  Philip S. Yu,et al.  Online Mining of Changes from Data Streams: Research Problems and Preliminary Results , 2003 .

[17]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[18]  Mohamed Medhat Gaber,et al.  On-board Mining of Data Streams in Sensor Networks , 2005 .

[19]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[20]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[21]  Philip S. Yu,et al.  Moment: maintaining closed frequent itemsets over a stream sliding window , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[22]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[23]  Jiawei Han,et al.  MAIDS: mining alarming incidents from data streams , 2004, SIGMOD '04.

[24]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[25]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[26]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[27]  Won Suk Lee,et al.  Finding recent frequent itemsets adaptively over online data streams , 2003, KDD '03.

[28]  Gerd Stumme,et al.  Mining frequent patterns with counting inference , 2000, SKDD.

[29]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[30]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[31]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.

[32]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[33]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[34]  Philip S. Yu,et al.  Loadstar: Load Shedding in Data Stream Mining , 2005, VLDB.

[35]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[36]  Mohamed Medhat Gaber,et al.  Resource-aware knowledge discovery in data streams , 2004 .

[37]  George O. Wesolowsky,et al.  THE WEBER PROBLEM: HISTORY AND PERSPECTIVES. , 1993 .

[38]  LastMark Online classification of nonstationary data streams , 2002 .

[39]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[40]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[41]  Mark Last,et al.  Online classification of nonstationary data streams , 2002, Intell. Data Anal..

[42]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[43]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[44]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[45]  Philip S. Yu,et al.  Mining Frequent Patterns in Data Streams at Multiple Time Granularities , 2002 .

[46]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[47]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[48]  Mohamed Medhat Gaber,et al.  Resource-aware Mining of Data Streams , 2005, J. Univers. Comput. Sci..

[49]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[50]  Kun Liu,et al.  VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring , 2004, SDM.

[51]  Michael Stonebraker,et al.  Load Shedding on Data Streams , 2003 .

[52]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[53]  Philip S. Yu,et al.  Resource-Aware Mining with Variable Granularities in Data Streams , 2004, SDM.

[54]  Carlo Zaniolo,et al.  An Adaptive Nearest Neighbor Classification Algorithm for Data Streams , 2005, PKDD.

[55]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[56]  Christos Faloutsos,et al.  Adaptive, Hands-Off Stream Mining , 2003, VLDB.

[57]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[58]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[59]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[60]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[61]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[62]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[63]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[64]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[65]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[66]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[67]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .