Modelling the characteristics of Web page outlinks

Using data sampled from top-level Web pages across five high-level domains and from sample pages within individual websites, the authors investigate the frequency distribution of outlinks in Web pages. The observed distributions were fitted to different theoretical distributions to determine the best-fitting model for representing outlink frequency across Web pages. Theoretical models tested include the modified power law (MPL), Mandelbrot (MDB), generalized Waring (GW), generalized inverse Gaussian-Poisson (GIGP), and generalized negative binomial (GNB) distributions. The GIGP and GNB provided good fits for data sets for top-level pages across the high level domains tested, with the GIGP performing slightly better. The lumpiness and bimodal nature of two of the observed outlink distributions from Web pages within a given website resulted in poor fits of the theoretical models. The GIGP was able to provide better fits to these data sets after the top components were truncated. The ability to effectively model Web page attributes, such as the distribution of the number of outlinks per page, paves the way for simulation models of Web page structural content, and makes it possible to estimate the number of outlinks that may be encountered within Web pages of a specific domain or within individual websites.

[1]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[2]  J. Stephen Downie,et al.  Informetric analysis of a music database , 2002, Scientometrics.

[3]  Benoit B. Mandelbrot,et al.  Structure Formelle des Textes et Communication , 1954 .

[4]  Allison Woodruff,et al.  An Investigation of Documents from the World Wide Web , 1996, Comput. Networks.

[5]  Michael R. Fenton,et al.  Yes, the GIGP Really Does Work--And Is Workable!. , 1993 .

[6]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[7]  Michael J. Nelson Stochastic Models for the Distribution of Index Terms , 1989, J. Documentation.

[8]  Paul Nicholls,et al.  Introduction to informetrics: Quantitative methods in library, documentation and information science , 1991 .

[9]  Lada A. Adamic,et al.  Internet: Growth dynamics of the World-Wide Web , 1999, Nature.

[10]  R. Rousseau Sitations: an exploratory study , 1997 .

[11]  H. S. Sichel,et al.  Anatomy of the Generalized Inverse Gaussian-Poisson Distribution with Special Applications to Bibliometric Studies , 1992, Inf. Process. Manag..

[12]  Lada A. Adamic,et al.  The Web's hidden order , 2001, CACM.

[13]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[14]  Felix Famoye Parameter estimation for generalized negative binomial distribution , 1997 .

[15]  Albert,et al.  Topology of evolving networks: local events and universality , 2000, Physical review letters.

[16]  R. Harald Baayen,et al.  Word Frequency Distributions , 2001 .

[17]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[18]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[20]  Mark Levene,et al.  A stochastic model for the evolution of the Web , 2002, Comput. Networks.

[21]  S. N. Dorogovtsev,et al.  WWW and Internet models from 1955 till our days and the ``popularity is attractive'' principle , 2000, cond-mat/0009090.

[22]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[23]  Margaret Ann Neale,et al.  Winners (don't) take all , 2003 .

[24]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[25]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[26]  Howard Rosenbaum,et al.  Can search engines be used as tools for web-link analysis? A critical view , 1999, J. Documentation.

[27]  Marcia J. Bates,et al.  AN EXPLORATORY PROFILE OF PERSONAL HOME PAGES: CONTENT, DESIGN, METAPHORS , 1997 .

[28]  H. S. Sichel,et al.  A bibliometric distribution which really works , 1985, J. Am. Soc. Inf. Sci..