Analysis and Management of Streaming Data: A Survey

The study on streaming data is one of the hot topics among the database circle all over the world recently. During the past three decades, conventional database technologies have been well developed and widely applied. Unfortunately, they could not be adopted to handle a new kind of data, named streaming data, which is generated from applications such as network routing, sensor networking, stock analysis, etc. Because of the rapid data arriving speed and huge size of data set in stream model, novel algorithms that only require seeing the whole data set once are devised to support aggregation queries on demand. In addition, this kind of algorithms usually owns a data structure far smaller than the size of the whole data set. The ways to devise such synopsis data structures are introduced. These different approaches are also compared by listing historical works upon two

[1]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[2]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[3]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[4]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[5]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[6]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[7]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[8]  Graham Cormode Stable Distributions for Stream Computations : it ’ s as easyas 0 , 1 , 2 , 2003 .

[9]  James K. Mullin,et al.  Optimal Semijoins for Distributed Database Systems , 1990, IEEE Trans. Software Eng..

[10]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[11]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[12]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[13]  Krishna Bharat,et al.  Supporting cooperative and personal surfing with a desktop assistant , 1997, UIST '97.

[14]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[15]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[16]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[17]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[18]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[19]  J. I. Munro,et al.  Towards Identifying Frequent Items in Sliding Windows , 2003 .

[20]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[21]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[22]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[23]  Yossi Matias,et al.  Spectral bloom filters , 2003, SIGMOD '03.

[24]  Li Wei,et al.  M-kernel merging: towards density estimation over data streams , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[25]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[26]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[27]  Wim Sweldens,et al.  An Overview of Wavelet Based Multiresolution Analyses , 1994, SIAM Rev..

[28]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[29]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[30]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[31]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[32]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[33]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[34]  Minos N. Garofalakis,et al.  Wavelet synopses with error guarantees , 2002, SIGMOD '02.

[35]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[36]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[37]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[38]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[39]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[40]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[41]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[42]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[43]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[44]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[45]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[46]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[47]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[48]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.