Sample size and informetric model goodness-of-fit outcomes: a search engine log case study

The influence of sample size on informetric characteristics is examined to determine whether theoretical mathematical models can adequately fit large data sets. Two large data sets of queries submitted to the Excite search service were sampled for search characteristics (term frequencies, terms used per query, pages viewed per query, queries submitted per session) producing data sets of various sizes that were fitted to theoretical models to determine how the sample may influence a model’s goodness-of-fit. Although theoretical models could adequately fit smaller data sets of up to 5000 observations in some cases, larger data sets could not be satisfactorily fitted using several goodness-of-fit techniques. Investigators must take into account that sample size does influence goodness-of-fit outcomes. The nature of the data and not the limitations of given goodness-of-fit tests results in significant outcomes. Such goodness-of-fit tests should be used for comparative purposes, rather than significance testing.

[1]  Leo Egghe,et al.  Sampling and concentration values of incomplete bibliographies , 2002, J. Assoc. Inf. Sci. Technol..

[2]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[3]  R. Harald Baayen,et al.  Word Frequency Distributions , 2001 .

[4]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[5]  Luis G. Vargas,et al.  Modeling the Uncertainty of Surgical Procedure Times: Comparison of Log-normal and Normal Models , 2000, Anesthesiology.

[6]  Richard L. Scheaffer,et al.  Elementary Survey Sampling , 1971 .

[7]  F. Famoye,et al.  Modeling household fertility decisions with generalized Poisson regression , 1997, Journal of population economics.

[8]  Michael J. Nelson Stochastic Models for the Distribution of Index Terms , 1989, J. Documentation.

[9]  Vartan Choulakian,et al.  Goodness-of-Fit Tests for the Generalized Pareto Distribution , 2001, Technometrics.

[10]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[11]  Ralph B. D'Agostino,et al.  Goodness-of-Fit-Techniques , 2020 .

[12]  Mark Crovella,et al.  Characteristics of WWW Client-based Traces , 1995 .

[13]  Alfred J. Lotka,et al.  The frequency distribution of scientific productivity , 1926 .

[14]  R. Adler,et al.  A practical guide to heavy tails: statistical techniques and applications , 1998 .

[15]  Gary M. Brittenham,et al.  The generalized χ2 goodness-of-fit test , 1994 .

[16]  S Lemeshow,et al.  Factors affecting the performance of the models in the Mortality Probability Model II system and strategies of customization: a simulation study. , 1996, Critical care medicine.

[17]  H. S. Sichel,et al.  Anatomy of the Generalized Inverse Gaussian-Poisson Distribution with Special Applications to Bibliometric Studies , 1992, Inf. Process. Manag..

[18]  Bernardo A. Huberman,et al.  The laws of the web - patterns in the ecology of information , 2001 .

[19]  Andreas Karlsson,et al.  Elementary Survey Sampling , 2007, Technometrics.

[20]  Wilbert C.M. Kallenberg,et al.  Power Approximations to Multinomial Tests of Fit , 1989 .

[21]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[22]  Arthur Stanley,et al.  Yes , 1923, The Hospital and health review.

[23]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[24]  Limsoon Wong,et al.  DATA MINING TECHNIQUES , 2003 .

[25]  Quentin L. Burrell,et al.  Yes, the GIGP Really Does Work - and Is Workable! , 1993, J. Am. Soc. Inf. Sci..

[26]  Amanda Spink,et al.  From E-Sex to E-Commerce: Web Search Changes , 2002, Computer.

[27]  EDF Tests for the Generalized Poisson Distribution , 1999 .

[28]  J. Stephen Downie,et al.  Informetric analysis of a music database , 2002, Scientometrics.

[29]  Dietmar Wolfram,et al.  Informetric modelling of internet search and browsing characteristics , 2004 .

[30]  A. K. Pujari,et al.  Data Mining Techniques , 2006 .

[31]  Norbert Henze,et al.  Empirical‐distribution‐function goodness‐of‐fit tests for discrete models , 1996 .

[32]  SpinkAmanda,et al.  Real life, real users, and real needs , 2000 .

[33]  Fuyuki Yoshikane Comparative Analysis of Author Productivity of Di � erent Domains in Consideration of the E � ect of Sample Size Dependency of the Statistical Measures , 2002 .

[34]  Amanda Spink,et al.  Vox populi: The public searching of the web , 2001, J. Assoc. Inf. Sci. Technol..

[35]  Amanda Spink,et al.  Analysis of large data logs: an application of Poisson sampling on excite web queries , 2002, Inf. Process. Manag..