Incremental web crawling as a competitive game of learning automata

There is no doubt that the World Wide Web has lived up to it’s hype of being the world’s central information highway through the past years. An increasing amount of versatile services keeps finding their way onto the Web as information providers continue to embrace the possibilities that the Web can offer. Especially the possibility of producing dynamic content has been an accelerant factor and is the reason why we now conveniently can participate in online auctions or see the latest development of our favorite stocks in near real-time from our own living rooms. However, for automated data mining applications that deploy crawlers to continuously capture the information provided by this new breed of services, the highly dynamic nature of the content is not convenient at all. As a matter of fact, a complete new set of challenges emerges where traditional crawling strategies are shown to be sub-optimal. Accordingly a new class of methods for crawling operations are clearly needed. Nonetheless, the problem area has so far been given limited attention in literature. In this thesis we address the new problem area of monitoring highly dynamic data sources of different importance. We use the concept of an incremental web crawler as a basis for our novel approach where we consider the incremental crawling task as a continuous learning problem where scheduling of monitoring tasks is combined with parameter estimation in an on-line manner. By mapping the problem to two variants of the so called knapsack problem we propose two solutions based on a machine learning technique known as learning automata. We show empirically that our proposed solutions continuously improve their performance through a learning process and that they are capable of operating in non-stationary environments. We also show their performance in comparison to alternative algorithms where, most notably, our schemes are shown to outdo the traditional uniform crawling scheme by factors up to 550% in certain situations.

[1]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[2]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[3]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[4]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[5]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[6]  Lili Qiu,et al.  The content and access dynamics of a busy Web site: findings and implications , 2000 .

[7]  G. Voelker,et al.  On the scale and performance of cooperative Web proxy caching , 2000, OPSR.

[8]  Sandeep Pandey,et al.  WIC: A General-Purpose Algorithm for Monitoring Web Information Sources , 2004, VLDB.

[9]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[10]  P. S. Sastry,et al.  Varieties of learning automata: an overview , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[11]  B. John Oommen,et al.  Deterministic Learning Automata Solutions to the Equipartitioning Problem , 1988, IEEE Trans. Computers.

[12]  Sandeep Pandey,et al.  Monitoring the dynamic web to respond to continuous queries , 2003, WWW '03.

[13]  Venkata N. Padmanabhan,et al.  The content and access dynamics of a busy web site: findings and implicatins , 2000, SIGCOMM.

[14]  Mukesh K. Mohania,et al.  A data-mining approach for optimizing performance of an incremental crawler , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[15]  Clement T. Yu,et al.  Improvements to an Algorithm for Equipartitioning , 1990, IEEE Trans. Computers.

[16]  Grenville Armitage,et al.  Bandwidth efficient web object change interval estimation , 2003 .

[17]  B. John Oommen,et al.  Fast object partitioning using Stochastic learning automata , 1987, SIGIR '87.

[18]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[19]  B. John Oommen,et al.  Stochastic searching on the line and its applications to parameter learning in nonlinear optimization , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[20]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[21]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[22]  M. Obeng,et al.  The American Heritage Stedman's Medical Dictionary , 2003 .

[23]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[24]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .