Join-distinct aggregate estimation over update streams

There is growing interest in algorithms for processing andquerying continuous data streams (i.e., data that is seenonly once in a fixed order) with limited memory resources.Providing (perhaps approximate) answers to queries over suchstreams is a crucial requirement for many application environments;examples include large IP network installations where performancedata from different parts of the network needs to be continuouslycollected and analyzed. The ability to estimate the number of distinct (sub)tuples inthe result of a join operation correlating two data streams (i.e.,the cardinality of a projection with duplicate elimination over ajoin) is an important requirement for several data-analysisscenarios. For instance, to enable real-time traffic analysis andload balancing, a network-monitoring application may need toestimate the number of distinct (<i>source</i>,destination) IP-address pairs occurring in the stream of IP packetsobserved by router <i>R</i><inf>1</inf>,where the source address is also seen in packets routed through adifferent router <i>R</i><inf>2</inf>.Earlier work has presented solutions to the individual problems ofdistinct counting and join-size estimation (without duplicateelimination) over streams. These solutions, however, arefundamentally different and extending or combining them to handleour more complex "Join-Distinct" estimation problem is far fromobvious. In this paper, we propose the <i>first</i>space-efficient algorithmic solution to the general Join-Distinctestimation problem over continuous data streams (our techniques canactually handle general <i>update streams</i>comprising tuple deletions as well as insertions). Our estimatorsare probabilistic in nature and rely on novel algorithms forbuilding and combining a new class of hash-based synopses (termed"JD <i>sketches</i>") for individual update streams. Wedemonstrate that our algorithms can provide low error,high-confidence Join-Distinct estimates using only small space andsmall processing time per update. In fact, we present lower boundsshowing that the space usage of our estimators is within smallfactors of the best possible for the Join-Distinct problem.Preliminary experimental results verify the effectiveness of ourapproach.

[1]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[2]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[3]  E. Kushilevitz,et al.  Communication Complexity: Basics , 1996 .

[4]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[5]  Rajeev Rastogi,et al.  Tracking set-expression cardinalities over continuous update streams , 2004, The VLDB Journal.

[6]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[7]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[8]  Bala Kalyanasundaram,et al.  The Probabilistic Communication Complexity of Set Intersection , 1992, SIAM J. Discret. Math..

[9]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[10]  Harald Niederreiter,et al.  Introduction to finite fields and their applications: Preface , 1994 .

[11]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[12]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[13]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[14]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[15]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[16]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[17]  Alan R. Simon,et al.  Understanding the New SQL: A Complete Guide , 1993 .

[18]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.