论文信息 - A large-scale study of robots.txt

A large-scale study of robots.txt

Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.

C. Lee Giles | Yang Sun | Ziming Zhuang

[1] Filippo Menczer,et al. Crawling the Web , 2004, Web Dynamics.

[2] Brian Kelly,et al. Webwatching UK Web Communities: Final Report For The WebWatch Project , 1999 .

[3] M. Carl Drott. Indexing aids at corporate websites: the use of robots.txt and META tags , 2002, Inf. Process. Manag..