An Architectural Framework of a Crawler for Locating Deep Web Repositories Using Learning Multi-agent Systems

The World Wide Web (WWW) has become one of the largest and most readily accessible repositories of human knowledge. The traditional search engines index only surface Web whose pages are easily found. The focus has now been moved to invisible Web or hidden Web, which consists of large warehouse of useful data such as images, sounds, presentations and many other types of media. To utilize such data, there is a need for specialized program to locate those sites as we do with search engines. This paper discusses about an effective design of a hidden Web crawler that can autonomously discover pages from the hidden Web by employing multi-agent Web mining system. A theoretical framework is suggested to investigate the resource discovery problem and the empirical results suggest substantial improvement in the crawling strategy and harvest rate.

[1]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[2]  Krishna Bharat,et al.  SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers , 1998, Comput. Networks.

[3]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[4]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[5]  B DanzigPeter,et al.  Scalable Internet resource discovery , 1994 .

[6]  Fabio Gasparetti,et al.  Swarm Intelligence: Agents for Adaptive Web Search , 2004, ECAI.

[7]  Vijay V. Raghavan,et al.  Incorporating agent based neural network model for adaptive meta-search , 2005, ACM-SE 43.

[8]  Peter B. Danzig,et al.  Scalable Internet resource discovery: research problems and approaches , 1994, CACM.

[9]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[10]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[11]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[12]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[13]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[14]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[15]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[16]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[17]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[18]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.