BioCrawler: An intelligent crawler for the semantic web

Web crawling has become an important aspect of web search, as the WWW keeps getting bigger and search engines strive to index the most important and up to date content. Many experimental approaches exist, but few actually try to model the current behaviour of search engines, which is to crawl and refresh the sites they deem as important, much more frequently than others. BioCrawler mirrors this behaviour on the semantic web, by applying the learning strategies adopted in previous work on ecosystem simulation, called BioTope. BioCrawler employs the principles of BioTope's intelligent agents on the semantic web, learns which sites are rich in semantic content and which sites link to them and adjusts its crawling habits accordingly. In the end, it learns to behave much like the state of the art search engine crawlers do. However, BioCrawler reaches that behavior solely by exploiting on-page factors, rather than off-page factors, such as the currently used link popularity.

[1]  Philip E. Seiden,et al.  A SIMULATION OF THE IMMUNE SYSTEM: EXPERIMENTS IN MACHINA , 1997 .

[2]  Joshua M. Epstein,et al.  Growing Artificial Societies: Social Science from the Bottom Up , 1996 .

[3]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4]  Berthier A. Ribeiro-Neto,et al.  CoBWeb-a crawler for the Brazilian Web , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[5]  Thomas S. Ray,et al.  An Approach to the Synthesis of Life , 1991 .

[6]  Martha E. Pollack,et al.  Introducing the Tileworld: Experimentally Evaluating Agent Architectures , 1990, AAAI.

[7]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[8]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[9]  Steven H. Kleinstein,et al.  Simulating the immune system , 2000, Comput. Sci. Eng..

[10]  François Bousquet,et al.  Distributed artificial intelligence and object-oriented modelling of a fishery , 1994 .

[11]  C. Branden,et al.  Introduction to protein structure , 1991 .

[12]  S. Forrest,et al.  The ecology of echo , 1997 .

[13]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[14]  David Eichmann,et al.  The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[15]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[16]  John H. Holland,et al.  Hidden Order: How Adaptation Builds Complexity , 1995 .

[17]  Stephen W. Pacala,et al.  Neighborhood models of plant population dynamics. 2. Multi-species models of annuals , 1986 .

[18]  S. Sitharama Iyengar Computer modeling and simulations of complex biological systems , 1998 .

[19]  Sebastiano Vigna,et al.  Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations , 2004, WAW.

[20]  D G Haile,et al.  Computer simulation of mosquito populations (Anopheles albimanus) for comparing the effectiveness of control technologies. , 1977, Journal of medical entomology.

[21]  John R. Koza,et al.  Hidden Order: How Adaptation Builds Complexity. , 1995, Artificial Life.

[22]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[23]  Stephen W. Pacala,et al.  Neighborhood Models of Plant Population Dynamics. 4. Single-Species and Multispecies Models of Annuals with Dormant Seeds , 1986, The American Naturalist.

[24]  Agostino Poggi,et al.  Developing multi‐agent systems with a FIPA‐compliant agent framework , 2001 .

[25]  Charles E. Taylor,et al.  Artificial Life II , 1991 .

[26]  Marios D. Dikaiakos,et al.  Design and Implementation of a Distributed Crawler and Filtering Processor , 2002, NGITS.

[27]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[28]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[29]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[30]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[31]  J. Haefner,et al.  Spatial Model of Movement and Foraging in Harvester Ants (Pogonomyrmex) (I): The Roles of Memory and Communication , 1994 .

[32]  S. Pacala,et al.  Neighborhood Models of Plant Population Dynamics. I. Single-Species Models of Annuals , 1985, The American Naturalist.

[33]  Knut Magne Risvik,et al.  Search engines and Web dynamics , 2002, Comput. Networks.

[34]  Nelson Minar,et al.  The Swarm Simulation System: A Toolkit for Building Multi-Agent Simulations , 1996 .

[35]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[36]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.