论文信息 - Power Law Distributions in Information Retrieval

Power Law Distributions in Information Retrieval

Several properties of information retrieval (IR) data, such as query frequency or document length, are widely considered to be approximately distributed as a power law. This common assumption aims to focus on specific characteristics of the empirical probability distribution of such data (e.g., its scale-free nature or its long/fat tail). This assumption, however, may not be always true. Motivated by recent work in the statistical treatment of power law claims, we investigate two research questions: (i) To what extent do power law approximations hold for term frequency, document length, query frequency, query length, citation frequency, and syntactic unigram frequency? And (ii) what is the computational cost of replacing ad hoc power law approximations with more accurate distribution fitting? We study 23 TREC and 5 non-TREC datasets and compare the fit of power laws to 15 other standard probability distributions. We find that query frequency and 5 out of 24 term frequency distributions are best approximated by a power law. All remaining properties are better approximated by the Inverse Gaussian, Generalized Extreme Value, Negative Binomial, or Yule distribution. We also find the overhead of replacing power law approximations by more informed distribution fitting to be negligible, with potential gains to IR tasks like index compression or test collection generation for IR evaluation.

[1] J. Drucker,et al. Regional Industrial Dominance and Business Success: A Productivity-Based Analysis [Dissertation] , 2007 .

[2] Hans Peter Luhn,et al. The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[3] Ripunjai K. Shukla,et al. On the proficient use of GEV distribution: a case study of subtropical monsoon region in India , 2012, 1203.0642.

[4] Gang Wang,et al. Exploiting query term correlation for list caching in web search engines , 2013, CIKM.

[5] Wolfgang G. Stock,et al. "Power tags" in information retrieval , 2010, Libr. Hi Tech.

[6] B. M. Hill,et al. A Simple General Approach to Inference About the Tail of a Distribution , 1975 .

[7] Ramon Ferrer-i-Cancho,et al. Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution , 2010, PloS one.

[8] Joshua Drucker,et al. Regional dominance and industrial success: a productivity-based analysis , 2007 .

[9] Gregory W. Corder,et al. Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[10] Alan F. Smeaton,et al. Replicating Web Structure in Small-Scale Test Collections , 2004, Information Retrieval.

[11] Raisa E. Feldman,et al. Limit Distributions for Sums of Independent Random Vectors , 2002 .

[12] Matthew Hurst,et al. BlogPulse: Automated Trend Discovery for Weblogs , 2003 .

[13] Avi Arampatzis,et al. A signal-to-noise approach to score normalization , 2009, CIKM.

[14] Stasa Milojevic,et al. Power law distributions in information science: Making the case for logarithmic binning , 2010, J. Assoc. Inf. Sci. Technol..

[15] Leonid Kopylev,et al. Constrained Parameters in Applications: Review of Issues and Approaches , 2012 .

[16] Wolfgang Kellerer,et al. Outtweeting the Twitterers - Predicting Information Cascades in Microblogs , 2010, WOSN.

[17] Enhong Chen,et al. Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[18] Ricardo A. Baeza-Yates,et al. A Three Level Search Engine Index Based in Query Log Distribution , 2003, SPIRE.

[19] Q. Vuong. Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses , 1989 .

[20] Debora Donato,et al. Determining Factors Behind the PageRank Log-Log Plot , 2007, WAW.

[21] Serena H. Chen,et al. Good practice in Bayesian network modelling , 2012, Environ. Model. Softw..

[22] Ingemar J. Cox,et al. On the Feasibility of Unstructured Peer-to-Peer Information Retrieval , 2011, ICTIR.

[23] Victor R. Lesser,et al. Multi-agent based peer-to-peer information retrieval systems with concurrent search sessions , 2006, AAMAS '06.

[24] G. Schwarz. Estimating the Dimension of a Model , 1978 .

[25] Reka Albert,et al. Mean-field theory for scale-free random networks , 1999 .

[26] John W. Emerson,et al. Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions , 2011, R J..

[27] H. Akaike. A new look at the statistical model identification , 1974 .

[28] Serge Fdida,et al. From popularity prediction to ranking online news , 2014, Social Network Analysis and Mining.

[29] David R. Anderson,et al. Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[30] Greg N. Gregoriou. Operational Risk Toward Basel III: Best Practices and Issues in Modeling, Management, and Regulation , 2009 .

[31] Jaap Kamps,et al. The Importance of Link Evidence in Wikipedia , 2008, ECIR.

[32] Gerard Salton,et al. Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[33] C. Mallows. More comments on C p , 1995 .

[34] Ibrahim Matta,et al. On the origin of power laws in Internet topologies , 2000, CCRV.

[35] Abraham Bookstein,et al. Informetric distributions, part I: Unified overview , 1990, J. Am. Soc. Inf. Sci..

[36] Matjaz Perc,et al. Zipf's law and log-normal distributions in measures of scientific output across fields and institutions: 40 years of Slovenia's research as an example , 2010, J. Informetrics.

[37] Leo Egghe,et al. The Distribution of N-Grams , 2000, Scientometrics.

[38] Abdur Chowdhury,et al. A picture of search , 2006, InfoScale '06.

[39] Gilad Mishne,et al. Leave a Reply: An Analysis of Weblog Comments , 2006 .

[40] Hosung Park,et al. What is Twitter, a social network or a news media? , 2010, WWW '10.

[41] Clifford M. Hurvich,et al. Regression and time series model selection in small samples , 1989 .

[42] Ricardo A. Baeza-Yates,et al. Extracting semantic relations from query logs , 2007, KDD '07.

[43] Stephen E. Fienberg,et al. Testing Statistical Hypotheses , 2005 .

[44] M. Newman. Power laws, Pareto distributions and Zipf's law , 2005 .

[45] Peter Ingwersen,et al. Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[46] Lahomtoires d'Electronique. AN INFORMATIONAL THEORY OF THE STATISTICAL STRUCTURE OF LANGUAGE 36 , 2010 .

[47] J. Eeckhout. Gibrat's Law for (All) Cities , 2004 .

[48] Pasquale Cirillo,et al. Are your data really Pareto distributed , 2013, 1306.0100.

[49] Wang Dahui,et al. True reason for Zipf's law in language , 2005 .

[50] D. Cox,et al. An Analysis of Transformations , 1964 .

[51] Bruce M. Maggs,et al. Efficient content location using interest-based locality in peer-to-peer systems , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[52] C. Schunn,et al. Evaluating Goodness-of-Fit in Comparison of Models to Data , 2005 .

[53] Christina Lioma. Part of speech N-grams for information retrieval , 2008 .

[54] Wolfgang Gatterbauer,et al. Rules of Thumb for Information Acquisition from Large and Redundant Data , 2010, ECIR.

[55] Ian T. Foster,et al. Mapping the Gnutella Network: Macroscopic Properties of Large-Scale Peer-to-Peer Systems , 2002, IPTPS.

[56] Christos Faloutsos,et al. Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[57] David M. Pennock,et al. Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[58] Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[59] P. Hall,et al. Estimating a tail exponent by modelling departure from a Pareto distribution , 1999 .

[60] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[61] Hongyuan Zha,et al. Exploring social annotations for information retrieval , 2008, WWW.

[62] H. Simon,et al. ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[63] Wei-Ying Ma,et al. Optimizing web search using web click-through data , 2004, CIKM '04.

[64] Pavlin Mavrodiev,et al. Social resilience in online communities: the autopsy of friendster , 2013, COSN '13.

[65] Paul Ormerod,et al. to be published , 1995 .

[66] Geoffrey Sampson,et al. Word frequency distributions , 2002, Computational Linguistics.

[67] Michael A. Bean,et al. Probability: The Science of Uncertainty with Applications to Investments, Insurance, and Engineering , 2000 .

[68] Jorma Rissanen,et al. Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[69] Kenneth Ward Church,et al. Heavy-tailed distributions and multi-keyword queries , 2007, SIGIR.

[70] C. L. Mallows. Some comments on C_p , 1973 .

[71] Nicole A. Lazar,et al. Statistics of Extremes: Theory and Applications , 2005, Technometrics.

[72] Torsten Suel,et al. Batch query processing for web search engines , 2011, WSDM '11.

[73] Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval , 1972 .

[74] Avi Arampatzis,et al. A study of query length , 2008, SIGIR '08.

[75] Luca Becchetti,et al. The distribution of pageRank follows a power-law only for particular values of the damping factor , 2006, WWW '06.

[76] Yoav Goldberg,et al. A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[77] Markus Koppenberger,et al. Topology of music recommendation networks. , 2006, Chaos.

[78] D. Posada,et al. Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. , 2004, Systematic biology.

[79] Christopher R. Palmer,et al. Generating network topologies that obey power laws , 2000, Globecom '00 - IEEE. Global Telecommunications Conference. Conference Record (Cat. No.00CH37137).

[80] Wuying Liu,et al. Power Law for Text Categorization , 2013, CCL.

[81] Hai Jin,et al. Efficient search for peer-to-peer information retrieval using semantic small world , 2006, WWW '06.

[82] Maarten de Rijke,et al. Using Prior Information Derived from Citations in Literature Search , 2007, RIAO.

[83] J. MacKinnon,et al. Several Tests for Model Specication in the Pres-ence of Alternative Hypotheses , 1981 .

[84] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[85] Mark Voorneveld,et al. Superstars without Talent? The Yule Distribution Controversy , 2009, The Review of Economics and Statistics.

[86] Michael Mitzenmacher,et al. A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[87] Leif Azzopardi. Query side evaluation: an empirical analysis of effectiveness and effort , 2009, SIGIR.

[88] Ioannis Partalas,et al. Re-ranking approach to classification in large-scale power-law distributed category systems , 2014, SIGIR.

[89] Li Fan,et al. Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[90] Aristides Gionis,et al. The impact of caching on search engines , 2007, SIGIR.

[91] Kevin A. Clarke. Nonparametric Model Discrimination in International Relations , 2003 .

[92] Ricardo A. Baeza-Yates,et al. Content-Based Image Retrieval and Characterization on Specific Web Collections , 2004, CIVR.

[93] Ian Soboroff,et al. Does WT10g look like the web? , 2002, SIGIR '02.

[94] Kevin A. Clarke. A Simple Distribution-Free Test for Nonnested Model Selection , 2007, Political Analysis.

[95] Azer Bestavros,et al. Sources and characteristics of Web temporal locality , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[96] N. L. Johnson,et al. Continuous Multivariate Distributions, Volume 1: Models and Applications , 2019 .

[97] Dietrich Klakow,et al. Hierarchical pitman-yor language model for information retrieval , 2010, SIGIR '10.

[98] Yan Lu,et al. Characteristics of character usage in Chinese Web searching , 2009, Inf. Process. Manag..

[99] Christina Lioma,et al. Part of speech n-grams and Information Retrieval , 2008 .

[100] Peter Nijkamp,et al. Accessibility of Cities in the Digital Economy , 2004, cond-mat/0412004.

[101] S. Redner. How popular is your paper? An empirical study of the citation distribution , 1998, cond-mat/9804163.

[102] V. Strickler,et al. Statistical String Theory for Courts: If the Data Don't Fit . . . . , 2008 .

[103] R. Strawderman. Continuous Multivariate Distributions, Volume 1: Models and Applications , 2001 .

[104] Luca Vogt,et al. When Genius Failed The Rise And Fall Of Long Term Capital Management , 2016 .

[105] J. Eric Bickel,et al. Reexamining Discrete Approximations to Continuous Distributions , 2013, Decis. Anal..

[106] X. Gabaix. Power Laws in Economics and Finance , 2008 .

[107] M. Crovella,et al. Estimating the Heavy Tail Index from Scaling Properties , 1999 .