Algorithmic Techniques for Processing Data Streams

We give a survey at some algorithmic techniques for processing data streams. After covering the basic methods of sampling and sketching, we present more evolved procedures that resort on those basic ones. In particular, we examine algorithmic schemes for similarity mining, the concept of group testing, and techniques for clustering and summarizing data streams.

[1]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[2]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[3]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[4]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[5]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[6]  Ketan Mulmuley,et al.  Computational geometry : an introduction through randomized algorithms , 1993 .

[7]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[8]  Harald Niederreiter,et al.  Probability and computing: randomized algorithms and probabilistic analysis , 2006, Math. Comput..

[9]  Mayur Datar,et al.  On the streaming model augmented with a sorting primitive , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[10]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[11]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[12]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[13]  Larry Carter,et al.  New Hash Functions and Their Use in Authentication and Set Equality , 1981, J. Comput. Syst. Sci..

[14]  John Beidler,et al.  Data Structures and Algorithms , 1996, Wiley Encyclopedia of Computer Science and Engineering.

[15]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[16]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[17]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[18]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[19]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[20]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[21]  Nicole Schweikardt,et al.  One-Pass Algorithm , 2009, Encyclopedia of Database Systems.

[22]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[23]  E. Kushilevitz,et al.  Communication Complexity: Basics , 1996 .

[24]  QUTdN QeO,et al.  Random early detection gateways for congestion avoidance , 1993, TNET.

[25]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[26]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[27]  Daryl Pregibon,et al.  Giga-Mining , 1998, KDD.

[28]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[29]  Cecilia R. Aragon,et al.  Randomized search trees , 1989, 30th Annual Symposium on Foundations of Computer Science.

[30]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[31]  C. Mallows,et al.  A Method for Simulating Stable Random Variables , 1976 .

[32]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[33]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[34]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[35]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[36]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[37]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[38]  Timothy M. Chan,et al.  Multi-Pass Geometric Algorithms , 2005, Discret. Comput. Geom..

[39]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[40]  Nisheeth Shrivastava,et al.  Space Efficient Streaming Algorithms for the Maximum Error Histogram , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[41]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[42]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[43]  Max Buot Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2006 .

[44]  Sudipto Guha Tight results for clustering and summarizing data streams , 2009, ICDT '09.

[45]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[46]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2003, J. Algorithms.

[47]  Divyakant Agrawal,et al.  Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[48]  Srinivasan Seshan,et al.  Detecting DDoS Attacks on ISP Networks , 2003 .

[49]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[50]  Philippe Flajolet,et al.  Probabilistic counting , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[51]  Sudipto Guha,et al.  A Note on Linear Time Algorithms for Maximum Error Histograms , 2007, IEEE Transactions on Knowledge and Data Engineering.

[52]  Rafail Ostrovsky,et al.  Optimal sampling from sliding windows , 2009, J. Comput. Syst. Sci..

[53]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[54]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[55]  Dimitris Sacharidis,et al.  Exploiting duality in summarization with deterministic guarantees , 2007, KDD '07.

[56]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[57]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[58]  Sudipto Guha,et al.  REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[59]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[60]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[61]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.