The "DGX" distribution for mining massive, skewed data

Skewed distributions appear very often in practice. Unfortunately, the traditional Zipf distribution often fails to model them well. In this paper, we propose a new probability distribution, the Discrete Gaussian Exponential (DGX), to achieve excellent fits in a wide variety of settings; our new distribution includes the Zipf distribution as a special case. We present a statistically sound method for estimating the DGX parameters based on maximum likelihood estimation (MLE). We applied DGX to a wide variety of real world data sets, such as sales data from a large retailer chain, us-age data from AT&T, and Internet clickstream data; in all cases, DGX fits these distributions very well, with almost a 99% correlation coefficient in quantile-quantile plots. Our algorithm also scales very well because it requires only a single pass over the data. Finally, we illustrate the power of DGX as a new tool for data mining tasks, such as outlier detection.

[1]  W. R. Fox,et al.  The Distribution of Surname Frequencies , 1983 .

[2]  Prostitution In Nevada,et al.  ANNALS of the Association of American Geographers , 1974 .

[3]  Charles Gide,et al.  Cours d'économie politique , 1911 .

[4]  Francis Galton,et al.  XII. The geometric mean, in vital and social statistics , 1879, Proceedings of the Royal Society of London.

[5]  D. Sornette,et al.  Stretched exponential distributions in nature and economy: “fat tails” with characteristic scales , 1998, cond-mat/9801293.

[6]  William L. Garrison,et al.  ALTERNATE EXPLANATIONS OF URBAN RANK-SIZE RELATIONSHIPS1 , 1958 .

[7]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[8]  H. L. Moore,et al.  Cours d'Économie Politique. By VILFREDO PARETO, Professeur à l'Université de Lausanne. Vol. I. Pp. 430. I896. Vol. II. Pp. 426. I897. Lausanne: F. Rouge , 1897 .

[9]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[10]  Bruce M. Hill,et al.  The Rank-Frequency Form of Zipf's Law , 1974 .

[11]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[12]  Steven Glassman,et al.  A Caching Relay for the World Wide Web , 1994, Comput. Networks ISDN Syst..

[13]  J. Marchal Cours d'economie politique , 1950 .

[14]  Virgílio A. F. Almeida,et al.  Characterizing reference locality in the WWW , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[15]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[16]  F. Famoye Continuous Univariate Distributions, Volume 1 , 1994 .

[17]  G. Herdan,et al.  Small particle statistics , 1954 .

[18]  Christos Faloutsos,et al.  On B-Tree Indices for Skewed Distributions , 1992, VLDB.

[19]  Mark Crovella,et al.  Characteristics of WWW Client-based Traces , 1995 .

[20]  Lada A. Adamic Zipf, Power-laws, and Pareto-a ranking tutorial , 2000 .