The Evolution of Link-Attributes for Pages and Its Implications on Web Crawling

It is important for an incremental crawler to know how web pages evolve and the relation between their changing frequencies and the link-attributes such as indegrees. This paper proposes a model for incremental crawling and performs an experiment to verify the correlation between them, by monitoring the evolution of all the link-attributes of the web pages within one website. Particularly, we look deeply into one special kind of page named Index-pages. From the experiment, we can make four conclusions: (1) Pages which have bigger indegrees, outdegrees or PageRank values change more often, and these link-attributes all approximately obey a power-law distribution. (2) The link-attributes of pages seldom change though the pages change themselves. (3) A small proportion of the pages link to most of the vertexes in the web graph. (4) The Index-pages link to sizeable new pages in a website. These conclusions can be used to greatly enhance the performance of an incremental crawler, which is the foremost component for general search engines and web information stores.