On Power-Law Distributed Balls in Bins and Its Applications to View Size Estimation

The view size estimation plays an important role in query optimization. It has been observed that many data follow a power law distribution. In this paper, we consider the balls in bins problem where we place balls into N bins when the bin selection probabilities follow a power law distribution. As a generalization to the coupon collector's problem, we address the problem of determining the expected number of balls that need to be thrown in order to have at least one ball in each of the N bins. We prove that $\Theta(\frac{N^\alpha \ln N}{c_N^{\alpha}})$ balls are needed to achieve this where α is the parameter of the power law distribution and $c_N^{\alpha}=\frac{\alpha-1}{\alpha-N^{\alpha-1}}$ for α≠1 and $c_N^{\alpha}=\frac{1}{\ln N}$ for α=1. Next, when fixing the number of balls that are thrown to T, we provide closed form upper and lower bounds on the expected number of bins that have at least one occupant. For n large and α>1, we prove that our bounds are tight up to a constant factor of $\left(\frac{\alpha}{\alpha-1}\right)^{1-\frac{1}{\alpha}} \leq e^{1/e} \simeq 1.4$ .

[1]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[2]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[3]  Moshe Shaked,et al.  Stochastic orders and their applications , 1994 .

[4]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[5]  V. Papanicolaou,et al.  General asymptotic estimates for the coupon collector problem , 1996 .

[6]  Christos Faloutsos,et al.  Modeling Skewed Distribution Using Multifractals and the '80-20' Law , 1996, VLDB.

[7]  M. Hofri,et al.  The coupon-collector problem revisited — a survey of engineering problems and computational methods , 1997 .

[8]  Masaaki Kijima,et al.  Stochastic orders and their applications in financial optimization , 1999, Math. Methods Oper. Res..

[9]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[10]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[11]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[12]  Philippe Flajolet,et al.  Loglog Counting of Large Cardinalities (Extended Abstract) , 2003, ESA.

[13]  Toby J. Teorey,et al.  A Pareto Model for OLAP View Size Estimation , 2001, Inf. Syst. Frontiers.

[14]  Andrzej Pelc,et al.  Deterministic Rendezvous in Graphs , 2003 .

[15]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[16]  Jianping Pan,et al.  Fast and accurate traffic matrix measurement using adaptive cardinality counting , 2005, MineNet '05.

[17]  Sergei Vassilvitskii,et al.  Distinct Values Estimators for Power Law Distributions , 2006, ANALCO.

[18]  Kamel Aouiche,et al.  A comparison of five probabilistic view-size estimation techniques in OLAP , 2007, DOLAP '07.

[19]  Peter J. Haas,et al.  Distinct-value synopses for multiset operations , 2009, CACM.

[20]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.