Modeling Skewed Distribution Using Multifractals and the '80-20' Law

PAPER NO. 1077 The focus of this paper is on the characterization of the skewness of an attribute-value distribution and on the extrapolations for interesting parameters. More speciically, given a vector with the highest h multiplicities ~ m = (m 1 ; m 2 ; :::; m h), and some frequency moments F q = P m q i , (e.g., q = 0; 2), we provide eeective schemes for obtaining estimates about either its statistics or subsets/supersets of the relation. We assume an 80/20 law, and speciically, a p=(1 ? p) law. This law gives a distribution which is commonly known in the fractals literature as`multifractal'. We show how to estimate p from the given information ((rst few multiplicities, and a few moments), and present the results of our experimentations on real data. Our results demonstrate that schemes based on our multifractal assumption consistently outperforms those schemes based on the uniformity assumption, which are commonly used in current DBMSs. Moreover, our schemes can be used to provide estimates for supersets of a relation, which the uniformity assumption based schemes can not not provide at all.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  W. H. Bowen,et al.  Oeuvres complètes. I , 1957 .

[3]  B. Mandelbrot THE STABLE PARETIAN INCOME DISTRIBUTION WHEN THE APPARENT EXPONENT IS NEAR TWO , 1963 .

[4]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[5]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[6]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[7]  Wen-Chi Hou,et al.  Statistical estimators for aggregate relational algebra queries , 1991, TODS.

[8]  Manfred Schroeder,et al.  Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise , 1992 .

[9]  Christos Faloutsos,et al.  On B-Tree Indices for Skewed Distributions , 1992, VLDB.

[10]  Stavros Christodoulakis,et al.  Optimal histograms for limiting worst-case error propagation in the size of join results , 1993, TODS.

[11]  Christos Faloutsos,et al.  Estimating the Selectivity of Spatial Queries Using the 'Correlation' Fractal Dimension , 1995, VLDB.

[12]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[13]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[14]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[15]  강태원,et al.  [서평]「Chaos and Fractals : New Frontiers of Science」 , 1998 .