Power Law Distributions in Information Retrieval

Several properties of information retrieval (IR) data, such as query frequency or document length, are widely considered to be approximately distributed as a power law. This common assumption aims to focus on specific characteristics of the empirical probability distribution of such data (e.g., its scale-free nature or its long/fat tail). This assumption, however, may not be always true. Motivated by recent work in the statistical treatment of power law claims, we investigate two research questions: (i) To what extent do power law approximations hold for term frequency, document length, query frequency, query length, citation frequency, and syntactic unigram frequency? And (ii) what is the computational cost of replacing ad hoc power law approximations with more accurate distribution fitting? We study 23 TREC and 5 non-TREC datasets and compare the fit of power laws to 15 other standard probability distributions. We find that query frequency and 5 out of 24 term frequency distributions are best approximated by a power law. All remaining properties are better approximated by the Inverse Gaussian, Generalized Extreme Value, Negative Binomial, or Yule distribution. We also find the overhead of replacing power law approximations by more informed distribution fitting to be negligible, with potential gains to IR tasks like index compression or test collection generation for IR evaluation.

[1]  J. Drucker,et al.  Regional Industrial Dominance and Business Success: A Productivity-Based Analysis [Dissertation] , 2007 .

[2]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[3]  Ripunjai K. Shukla,et al.  On the proficient use of GEV distribution: a case study of subtropical monsoon region in India , 2012, 1203.0642.

[4]  Gang Wang,et al.  Exploiting query term correlation for list caching in web search engines , 2013, CIKM.

[5]  Wolfgang G. Stock,et al.  "Power tags" in information retrieval , 2010, Libr. Hi Tech.

[6]  B. M. Hill,et al.  A Simple General Approach to Inference About the Tail of a Distribution , 1975 .

[7]  Ramon Ferrer-i-Cancho,et al.  Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution , 2010, PloS one.

[8]  Joshua Drucker,et al.  Regional dominance and industrial success: a productivity-based analysis , 2007 .

[9]  Gregory W. Corder,et al.  Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[10]  Alan F. Smeaton,et al.  Replicating Web Structure in Small-Scale Test Collections , 2004, Information Retrieval.

[11]  Raisa E. Feldman,et al.  Limit Distributions for Sums of Independent Random Vectors , 2002 .

[12]  Matthew Hurst,et al.  BlogPulse: Automated Trend Discovery for Weblogs , 2003 .

[13]  Avi Arampatzis,et al.  A signal-to-noise approach to score normalization , 2009, CIKM.

[14]  Stasa Milojevic,et al.  Power law distributions in information science: Making the case for logarithmic binning , 2010, J. Assoc. Inf. Sci. Technol..

[15]  Leonid Kopylev,et al.  Constrained Parameters in Applications: Review of Issues and Approaches , 2012 .

[16]  Wolfgang Kellerer,et al.  Outtweeting the Twitterers - Predicting Information Cascades in Microblogs , 2010, WOSN.

[17]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[18]  Ricardo A. Baeza-Yates,et al.  A Three Level Search Engine Index Based in Query Log Distribution , 2003, SPIRE.

[19]  Q. Vuong Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses , 1989 .

[20]  Debora Donato,et al.  Determining Factors Behind the PageRank Log-Log Plot , 2007, WAW.

[21]  Serena H. Chen,et al.  Good practice in Bayesian network modelling , 2012, Environ. Model. Softw..

[22]  Ingemar J. Cox,et al.  On the Feasibility of Unstructured Peer-to-Peer Information Retrieval , 2011, ICTIR.

[23]  Victor R. Lesser,et al.  Multi-agent based peer-to-peer information retrieval systems with concurrent search sessions , 2006, AAMAS '06.

[24]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[25]  Reka Albert,et al.  Mean-field theory for scale-free random networks , 1999 .

[26]  John W. Emerson,et al.  Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions , 2011, R J..

[27]  H. Akaike A new look at the statistical model identification , 1974 .

[28]  Serge Fdida,et al.  From popularity prediction to ranking online news , 2014, Social Network Analysis and Mining.

[29]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[30]  Greg N. Gregoriou Operational Risk Toward Basel III: Best Practices and Issues in Modeling, Management, and Regulation , 2009 .

[31]  Jaap Kamps,et al.  The Importance of Link Evidence in Wikipedia , 2008, ECIR.

[32]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[33]  C. Mallows More comments on C p , 1995 .

[34]  Ibrahim Matta,et al.  On the origin of power laws in Internet topologies , 2000, CCRV.

[35]  Abraham Bookstein,et al.  Informetric distributions, part I: Unified overview , 1990, J. Am. Soc. Inf. Sci..

[36]  Matjaz Perc,et al.  Zipf's law and log-normal distributions in measures of scientific output across fields and institutions: 40 years of Slovenia's research as an example , 2010, J. Informetrics.

[37]  Leo Egghe,et al.  The Distribution of N-Grams , 2000, Scientometrics.

[38]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[39]  Gilad Mishne,et al.  Leave a Reply: An Analysis of Weblog Comments , 2006 .

[40]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[41]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[42]  Ricardo A. Baeza-Yates,et al.  Extracting semantic relations from query logs , 2007, KDD '07.

[43]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[44]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[45]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[46]  Lahomtoires d'Electronique AN INFORMATIONAL THEORY OF THE STATISTICAL STRUCTURE OF LANGUAGE 36 , 2010 .

[47]  J. Eeckhout Gibrat's Law for (All) Cities , 2004 .

[48]  Pasquale Cirillo,et al.  Are your data really Pareto distributed , 2013, 1306.0100.

[49]  Wang Dahui,et al.  True reason for Zipf's law in language , 2005 .

[50]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[51]  Bruce M. Maggs,et al.  Efficient content location using interest-based locality in peer-to-peer systems , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[52]  C. Schunn,et al.  Evaluating Goodness-of-Fit in Comparison of Models to Data , 2005 .

[53]  Christina Lioma Part of speech N-grams for information retrieval , 2008 .

[54]  Wolfgang Gatterbauer,et al.  Rules of Thumb for Information Acquisition from Large and Redundant Data , 2010, ECIR.

[55]  Ian T. Foster,et al.  Mapping the Gnutella Network: Macroscopic Properties of Large-Scale Peer-to-Peer Systems , 2002, IPTPS.

[56]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[57]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[59]  P. Hall,et al.  Estimating a tail exponent by modelling departure from a Pareto distribution , 1999 .

[60]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[61]  Hongyuan Zha,et al.  Exploring social annotations for information retrieval , 2008, WWW.

[62]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[63]  Wei-Ying Ma,et al.  Optimizing web search using web click-through data , 2004, CIKM '04.

[64]  Pavlin Mavrodiev,et al.  Social resilience in online communities: the autopsy of friendster , 2013, COSN '13.

[65]  Paul Ormerod,et al.  to be published , 1995 .

[66]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[67]  Michael A. Bean,et al.  Probability: The Science of Uncertainty with Applications to Investments, Insurance, and Engineering , 2000 .

[68]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[69]  Kenneth Ward Church,et al.  Heavy-tailed distributions and multi-keyword queries , 2007, SIGIR.

[70]  C. L. Mallows Some comments on C_p , 1973 .

[71]  Nicole A. Lazar,et al.  Statistics of Extremes: Theory and Applications , 2005, Technometrics.

[72]  Torsten Suel,et al.  Batch query processing for web search engines , 2011, WSDM '11.

[73]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[74]  Avi Arampatzis,et al.  A study of query length , 2008, SIGIR '08.

[75]  Luca Becchetti,et al.  The distribution of pageRank follows a power-law only for particular values of the damping factor , 2006, WWW '06.

[76]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[77]  Markus Koppenberger,et al.  Topology of music recommendation networks. , 2006, Chaos.

[78]  D. Posada,et al.  Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. , 2004, Systematic biology.

[79]  Christopher R. Palmer,et al.  Generating network topologies that obey power laws , 2000, Globecom '00 - IEEE. Global Telecommunications Conference. Conference Record (Cat. No.00CH37137).

[80]  Wuying Liu,et al.  Power Law for Text Categorization , 2013, CCL.

[81]  Hai Jin,et al.  Efficient search for peer-to-peer information retrieval using semantic small world , 2006, WWW '06.

[82]  Maarten de Rijke,et al.  Using Prior Information Derived from Citations in Literature Search , 2007, RIAO.

[83]  J. MacKinnon,et al.  Several Tests for Model Specication in the Pres-ence of Alternative Hypotheses , 1981 .

[84]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[85]  Mark Voorneveld,et al.  Superstars without Talent? The Yule Distribution Controversy , 2009, The Review of Economics and Statistics.

[86]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[87]  Leif Azzopardi Query side evaluation: an empirical analysis of effectiveness and effort , 2009, SIGIR.

[88]  Ioannis Partalas,et al.  Re-ranking approach to classification in large-scale power-law distributed category systems , 2014, SIGIR.

[89]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[90]  Aristides Gionis,et al.  The impact of caching on search engines , 2007, SIGIR.

[91]  Kevin A. Clarke Nonparametric Model Discrimination in International Relations , 2003 .

[92]  Ricardo A. Baeza-Yates,et al.  Content-Based Image Retrieval and Characterization on Specific Web Collections , 2004, CIVR.

[93]  Ian Soboroff,et al.  Does WT10g look like the web? , 2002, SIGIR '02.

[94]  Kevin A. Clarke A Simple Distribution-Free Test for Nonnested Model Selection , 2007, Political Analysis.

[95]  Azer Bestavros,et al.  Sources and characteristics of Web temporal locality , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[96]  N. L. Johnson,et al.  Continuous Multivariate Distributions, Volume 1: Models and Applications , 2019 .

[97]  Dietrich Klakow,et al.  Hierarchical pitman-yor language model for information retrieval , 2010, SIGIR '10.

[98]  Yan Lu,et al.  Characteristics of character usage in Chinese Web searching , 2009, Inf. Process. Manag..

[99]  Christina Lioma,et al.  Part of speech n-grams and Information Retrieval , 2008 .

[100]  Peter Nijkamp,et al.  Accessibility of Cities in the Digital Economy , 2004, cond-mat/0412004.

[101]  S. Redner How popular is your paper? An empirical study of the citation distribution , 1998, cond-mat/9804163.

[102]  V. Strickler,et al.  Statistical String Theory for Courts: If the Data Don't Fit . . . . , 2008 .

[103]  R. Strawderman Continuous Multivariate Distributions, Volume 1: Models and Applications , 2001 .

[104]  Luca Vogt,et al.  When Genius Failed The Rise And Fall Of Long Term Capital Management , 2016 .

[105]  J. Eric Bickel,et al.  Reexamining Discrete Approximations to Continuous Distributions , 2013, Decis. Anal..

[106]  X. Gabaix Power Laws in Economics and Finance , 2008 .

[107]  M. Crovella,et al.  Estimating the Heavy Tail Index from Scaling Properties , 1999 .

[108]  W. Reed The Pareto law of incomes—an explanation and an extension , 2003 .

[109]  Wolfgang Nejdl,et al.  Can all tags be used for search? , 2008, CIKM '08.

[110]  Francis Jack Smith,et al.  Extension of Zipf’s Law to Word and Character N-grams for English and Chinese , 2003, ROCLING/IJCLCLP.

[111]  J. Hilbe Negative Binomial Regression: Preface , 2007 .

[112]  Zhiyong Lu,et al.  Predicting clicks of PubMed articles , 2013, AMIA.

[113]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[114]  Noriaki Kawamae,et al.  Supervised N-gram topic model , 2014, WSDM.

[115]  Mao Ye,et al.  Exploiting geographical influence for collaborative point-of-interest recommendation , 2011, SIGIR.

[116]  G. Āllport The Psycho-Biology of Language. , 1936 .

[117]  Andreas Hotho,et al.  Information Retrieval in Folksonomies: Search and Ranking , 2006, ESWC.

[118]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[119]  H Pashler,et al.  How persuasive is a good fit? A comment on theory testing. , 2000, Psychological review.

[120]  Lada A. Adamic,et al.  Search in Power-Law Networks , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[121]  Lada A. Adamic,et al.  Evolutionary Dynamics of the World Wide Web , 1999 .

[122]  R. E. Wheeler Statistical distributions , 1983, APLQ.

[123]  M. Meerschaert,et al.  Limit Distributions for Sums of Independent Random Vectors: Heavy Tails in Theory and Practice , 2001 .

[124]  Jérôme Kunegis,et al.  Fairness on the web: alternatives to the power law , 2012, WebSci '12.

[125]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[126]  Yuval Shavitt,et al.  On the Applicability of Peer-to-peer Data in Music Information Retrieval Research , 2010, ISMIR.

[127]  H. Bauke Parameter estimation for power-law distributions by maximum likelihood methods , 2007, 0704.1867.

[128]  M. Evans Statistical Distributions , 2000 .

[129]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[130]  Matthias Hagen,et al.  The power of naive query segmentation , 2010, SIGIR '10.

[131]  Robert Tappan Morris,et al.  DNS performance and the effectiveness of caching , 2001, IMW '01.

[132]  Andreas Hotho,et al.  Logsonomy - social information retrieval with logdata , 2008, Hypertext.

[133]  Brian Peacock,et al.  Statistical Distributions: Forbes/Statistical Distributions 4E , 2010 .

[134]  Albert Maydeu-Olivares,et al.  Goodness-of-Fit Testing , 2010 .

[135]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[136]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[137]  William J. Reed,et al.  The Double Pareto-Lognormal Distribution—A New Parametric Model for Size Distributions , 2004, WWW 2001.

[138]  G. Miller,et al.  Some effects of intermittent silence. , 1957, The American journal of psychology.

[139]  Leif Azzopardi,et al.  Age Dependent Document Priors in Link Structure Analysis , 2005, ECIR.

[140]  Valentin Robu,et al.  The complex dynamics of collaborative tagging , 2007, WWW '07.

[141]  Jean Monnet-Saint-Etienne Discretization of Continuous Attributes , 2015 .

[142]  Mark B. Sandler,et al.  Music Information Retrieval Using Social Tags and Audio , 2009, IEEE Transactions on Multimedia.

[143]  Domenico Cantone,et al.  Finite State Models for the Generation of Large Corpora of Natural Language Texts , 2009, FSMNLP.

[144]  Venugopalan Ramasubramanian,et al.  Beehive: Exploiting Power Law Query Distributions for O(1) Lookup Performance in Peer to Peer Overlays , 2003 .

[145]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[146]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[147]  Emmanuel J. Yannakoudakis,et al.  n-Grams and their implication to natural language understanding , 1990, Pattern Recognit..

[148]  Iadh Ounis,et al.  Light Syntactically-Based Index Pruning for Information Retrieval , 2007, ECIR.

[149]  M. Clements,et al.  The influence of personalization on tag query length in social media search , 2010, Inf. Process. Manag..

[150]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[151]  Hiroshi Nakagawa,et al.  Topic models with power-law using Pitman-Yor process , 2010, KDD.

[152]  R. Albert,et al.  The large-scale organization of metabolic networks , 2000, Nature.

[153]  Nitish Srivastava,et al.  Modeling Documents with Deep Boltzmann Machines , 2013, UAI.

[154]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.

[155]  Iadh Ounis,et al.  A syntactically-based query reformulation technique for information retrieval , 2008, Inf. Process. Manag..

[156]  András A. Benczúr,et al.  SpamRank - fully automatic link spam detection. Work in progress , 2005 .

[157]  Roelof van Zwol,et al.  Flickr tag recommendation based on collective knowledge , 2008, WWW.

[158]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.