Power-Law Distributions in Empirical Data

Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the large fluctuations that occur in the tail of the distribution—the part of the distribution representing large but rare events—and by the difficulty of identifying the range over which power-law behavior holds. Commonly used methods for analyzing power-law data, such as least-squares fitting, can produce substantially inaccurate estimates of parameters for power-law distributions, and even in cases where such methods return accurate answers they are still unsatisfactory because they give no indication of whether the data obey a power law at all. Here we present a principled statistical framework for discerning and quantifying power-law behavior in empirical data. Our approach combines maximum-likelihood fitting methods with goodness-of-fit tests based on the Kolmogorov-Smirnov (KS) statistic and likelihood ratios. We evaluate the effectiveness of the approach with tests on synthetic data and give critical comparisons to previous approaches. We also apply the proposed methods to twenty-four real-world data sets from a range of different disciplines, each of which has been conjectured to follow a power-law distribution. In some cases we find these conjectures to be consistent with the data, while in others the power law is ruled out.

[1]  The Assurance Magazine and Journal of the Institute of Actuaries , 1862, The British and Foreign Medico-Chirurgical Review.

[2]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[3]  H. Jeffreys Some Tests of Significance, Treated by the Theory of Probability , 1935, Mathematical Proceedings of the Cambridge Philosophical Society.

[4]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[5]  E. Andrade Contemporary Physics , 1945, Nature.

[6]  H. Seal The Maximum Likelihood Fitting of the Discrete Pareto Law , 1952 .

[7]  Jstor The journal of conflict resolution , 1957 .

[8]  A. Muniruzzaman On Measures of Location and Dispersion and Tests of Hypotheses in a Pare to Population , 1957 .

[9]  Journal of Molecular Biology , 1959, Nature.

[10]  W. D. Wightman Philosophical Transactions of the Royal Society , 1961, Nature.

[11]  E. Parzen Annals of Mathematical Statistics , 1962 .

[12]  L. Goddard Information Theory , 1962, Nature.

[13]  S. Goldhor Ecology , 1964, The Yale Journal of Biology and Medicine.

[14]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[15]  H. Riedwyl Goodness of Fit , 1967 .

[16]  R. F. Brown,et al.  PERFORMANCE EVALUATION , 2019, ISO 22301:2019 and business continuity management – Understand how to plan, implement and enhance a business continuity management system (BCMS).

[17]  R. Cox,et al.  Journal of the Royal Statistical Society B , 1972 .

[18]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[19]  B. M. Hill,et al.  A Simple General Approach to Inference About the Tail of a Distribution , 1975 .

[20]  L. Engwall Skew distributions and the sizes of business firms , 1976 .

[21]  H. A. Simon,et al.  Skew Distributions and the Size of Business Firms , 1977 .

[22]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[23]  T. Hassard,et al.  Applied Linear Regression , 2005 .

[24]  J. David Singer,et al.  Resort to Arms: International and Civil Wars, 1816-1980 , 1982 .

[25]  D. Mason Laws of Large Numbers for Sums of Extreme Values , 1982 .

[26]  P. Hall On Some Simple Estimates of an Exponent of Regular Variation , 1982 .

[27]  S. Weisberg,et al.  Applied Linear Regression (2nd ed.). , 1986 .

[28]  Ralph B. D'Agostino,et al.  Goodness-of-Fit-Techniques , 2020 .

[29]  R. Fildes Journal of the American Statistical Association : William S. Cleveland, Marylyn E. McGill and Robert McGill, The shape parameter for a two variable graph 83 (1988) 289-300 , 1989 .

[30]  Q. Vuong Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses , 1989 .

[31]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[32]  L. Tierney,et al.  Fully Exponential Laplace Approximations to Expectations and Variances of Nonpositive Functions , 1989 .

[33]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[34]  William H. Press,et al.  Numerical recipes in C++: the art of scientific computing, 2nd Edition (C++ ed., print. is corrected to software version 2.10) , 1994 .

[35]  D. Cox,et al.  Inference and Asymptotics , 1994 .

[36]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[37]  J. Herskowitz,et al.  Proceedings of the National Academy of Sciences, USA , 1996, Current Biology.

[38]  S. Resnick,et al.  The qq-estimator and heavy tails , 1996 .

[39]  R. Adler,et al.  A practical guide to heavy tails: statistical techniques and applications , 1998 .

[40]  V. Paxson,et al.  WHERE MATHEMATICS MEETS THE INTERNET , 1998 .

[41]  D. Turcotte,et al.  Fractality and Self-Organized Criticality of Wars , 1998 .

[42]  R. Tweney Error and the growth of experimental knowledge , 1998 .

[43]  V. Paxson,et al.  Notices of the American Mathematical Society , 1998 .

[44]  S. Redner How popular is your paper? An empirical study of the citation distribution , 1998, cond-mat/9804163.

[45]  Lada A. Adamic,et al.  The Nature of Markets in the World Wide Web , 1999 .

[46]  Udo R. Krieger,et al.  Nonparametric estimation of long-tailed density functions and its application to the analysis of World Wide Web traffic , 2000, Perform. Evaluation.

[47]  H. Prosper Bayesian Analysis , 2000, hep-ph/0006356.

[48]  Fan Chung Graham,et al.  A random graph model for massive graphs , 2000, STOC '00.

[49]  D. Sornette Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder: Concepts and Tools , 2000 .

[50]  T. Ito,et al.  Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[51]  The Astrophysical Journal , 2000 .

[52]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[53]  Mark A. McComb A Practical Guide to Heavy Tails , 2000, Technometrics.

[54]  S. Redner,et al.  Connectivity of growing random networks. , 2000, Physical review letters.

[55]  C. Tsallis,et al.  Are citations of scientific papers a case of nonextensivity? , 1999, cond-mat/9903433.

[56]  Jeffery R. Westbrook,et al.  A Functional Approach to External Graph Algorithms , 1998, Algorithmica.

[57]  Mischa Schwartz,et al.  ACM SIGCOMM computer communication review , 2001, CCRV.

[58]  Stephanie Forrest,et al.  Email networks and the spread of computer viruses. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[59]  Yannick Malevergne,et al.  Empirical distributions of stock returns: between the stretched exponential and the power law? , 2003, physics/0305089.

[60]  F. Lillo,et al.  What really causes large price changes? , 2003, cond-mat/0312703.

[61]  Shmuel Sattath,et al.  How reliable are experimental protein-protein interaction data? , 2003, Journal of molecular biology.

[62]  Kate E. Jones,et al.  Body mass of late Quaternary mammals , 2003 .

[63]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[64]  Michel L. Goldstein,et al.  Problems with fitting to the power-law distribution , 2004, cond-mat/0402322.

[65]  M. Wheatland A Bayesian Approach to Solar Flare Prediction , 2004, astro-ph/0403613.

[66]  A. Barabasi,et al.  Functional and topological characterization of protein interaction networks , 2004, Proteomics.

[67]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[68]  M. Handcock,et al.  Likelihood-based inference for stochastic models of sexual network formation. , 2004, Theoretical population biology.

[69]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[70]  Yuhong Yang,et al.  Information Theory, Inference, and Learning Algorithms , 2005 .

[71]  Malik Beshir Malik,et al.  Applied Linear Regression , 2005, Technometrics.

[72]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[73]  Michael Mitzenmacher,et al.  Editorial: The Future of Power Law Research , 2005, Internet Math..

[74]  Stuart Barber,et al.  All of Statistics: a Concise Course in Statistical Inference , 2005 .

[75]  Cristopher Moore,et al.  On the bias of traceroute sampling: or, power-law degree distributions in regular graphs , 2005, STOC '05.

[76]  L. Wasserman Frequentist Bayes is objective (comment on articles by Berger and by Goldstein) , 2006 .

[77]  Cristopher Moore,et al.  Structural Inference of Hierarchies in Networks , 2006, SNA@ICML.

[78]  D. Hinkley Annals of Statistics , 2006 .

[79]  L. Haan,et al.  Extreme value theory : an introduction , 2006 .

[80]  J. Rojo Optimality : the second Erich L. Lehmann Symposium , 2006 .

[81]  S. Resnick Heavy-Tail Phenomena: Probabilistic and Statistical Modeling , 2006 .

[82]  L. Haan,et al.  Extreme value theory , 2006 .

[83]  D. Cox,et al.  Frequentist statistics as a theory of inductive inference , 2006, math/0610846.

[84]  Dmitri V. Krioukov,et al.  AS relationships: inference and validation , 2006, CCRV.

[85]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[86]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[87]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[88]  Petter Holme,et al.  Currency and commodity metabolites: their identification and relation to the modularity of metabolic networks. , 2006, IET systems biology.

[89]  H. Bauke Parameter estimation for power-law distributions by maximum likelihood methods , 2007, 0704.1867.

[90]  Petter Holme,et al.  Radial structure of the Internet , 2006, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[91]  A. Clauset,et al.  On the Frequency of Severe Terrorist Events , 2006, physics/0606007.

[92]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[93]  M. Newman,et al.  Hierarchical structure and the prediction of missing links in networks , 2008, Nature.

[94]  Lance Fortnow,et al.  Proceedings of the 55th Annual ACM Symposium on Theory of Computing , 2011, STOC.

[95]  A. Clauset,et al.  On the bias of traceroute sampling: Or, power-law degree distributions in regular graphs , 2009 .

[96]  C. Shalizi Dynamics of Bayesian Updating with Dependent Data and Misspecified Models , 2009, 0901.1342.

[97]  J. Bader,et al.  Dynamic Networks from Hierarchical Bayesian Graph Clustering , 2010, PloS one.

[98]  Benjamin H. Good,et al.  Performance of modularity maximization in practical contexts. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[99]  Cristopher Moore,et al.  Active Learning for Hidden Attributes in Networks , 2010, ArXiv.

[100]  B. Garcia,et al.  Proteomics , 2011, Journal of biomedicine & biotechnology.

[101]  George Michailidis,et al.  Estimating Heavy-Tail Exponents Through Max Self–Similarity , 2006, IEEE Transactions on Information Theory.

[102]  October I Physical Review Letters , 2022 .