The Ethicality of Web Crawlers

Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implementing the Robots Exclusion Protocol in a file called “robots.txt”. Although not an official standard, the Robots Exclusion Protocol has been adopted to a greater or lesser extent by nearly all commercial search engines and popular crawlers. As many web site administrators and policy makers have come to rely on the informal contract set forth by the Robots Exclusion Protocol, the degree to which web crawlers respect robots.txt policies has become an important issue of computer ethics. In this research, we investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We test the behaviors of web crawlers in terms of ethics by deploying a crawler honeypot: a set of websites where each site is configured with a distinct regulation specification using the Robots Exclusion Protocol in order to capture specific behaviors of web crawlers.We propose a vector space model to represent crawler behavior and a set of models to measure the ethics of web crawlers based on their behaviors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers receive fairly low ethicality violation scores which means most of the crawlers’ behaviors are ethical; however, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules.

[1]  Vincent J. Calluzzo,et al.  Ethics in Information Technology and Software Use , 2004 .

[2]  Michael L. Nelson,et al.  Observed Web Robot Behavior on Decaying Web Subsites , 2006, D Lib Mag..

[3]  David Eichmann,et al.  2 – Background : Agents in General and Spiders in Particular , 1994 .

[4]  C. Lee Giles,et al.  Determining Bias to Search Engines from Robots.txt , 2007, Web Intelligence.

[5]  Virgílio A. F. Almeida,et al.  Analyzing robot behavior in e-business sites , 2001, SIGMETRICS '01.

[6]  Chris Armen,et al.  Towards Machine Ethics: Implementing Two Action-Based Ethical Theories , 2005 .

[7]  Marios D. Dikaiakos,et al.  An investigation of web crawler behavior: characterization and metrics , 2005, Comput. Commun..

[8]  L. Floridi © 1999 Kluwer Academic Publishers. Printed in the Netherlands. Information ethics: On the philosophical foundation of computer ethics ⋆ , 2022 .

[9]  C. Lee Giles,et al.  BotSeer: An Automated Information System for Analyzing Web Robots , 2008, 2008 Eighth International Conference on Web Engineering.

[10]  Rónán O'Beirne,et al.  The Blackwell Guide to the Philosophy of Computing and Information , 2004 .

[11]  Michael C. Loui,et al.  Taking the byte out of cookies: privacy, consent, and the Web , 1998, SIGCAS Comput. Soc..

[12]  Dirk Grunwald,et al.  Legal issues surrounding monitoring during network research , 2007, IMC '07.

[13]  Khaled Khelif,et al.  Supporting Patent Mining by using Ontology-based Semantic Annotations , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[14]  C. Lee Giles,et al.  A large-scale study of robots.txt , 2007, WWW '07.

[15]  A. Oskamp,et al.  Agent Exclusion on Websites , 2005 .

[16]  L. Floridi Blackwell Guide to the Philosophy of Computing and Information , 2003 .

[17]  Robert Alun Jones,et al.  The Ethics of Research in Cyberspace , 1994 .

[18]  Mike Thelwall,et al.  Web crawling ethics revisited: Cost, privacy, and denial of service , 2006, J. Assoc. Inf. Sci. Technol..

[19]  Wendell Wallach,et al.  Why Machine Ethics? , 2006, IEEE Intelligent Systems.

[20]  Margaret Anne Pierce,et al.  Computer ethics: The role of personal, informal, and formal codes , 1996 .