Distributed and collaborative Web Change Detection system

Search engines use crawlers to traverse the Web in order to download web pages and build their indexes. Maintaining these indexes up-to-date is an essential task to ensure the quality of search results. However, changes in web pages are unpredictable. Identifying the moment when a web page changes as soon as possible and with minimal computational cost is a major challenge. In this article we present the Web Change Detection system that, in a best case scenario, is capable to detect, almost in real time, when a web page changes. In a worst case scenario, it will require, on average, 12 minutes to detect a change on a low PageRank web site and about one minute on a web site with high PageRank. Meanwhile, current search engines require more than a day, on average, to detect a modification in a web page (in both cases).

[1]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[2]  M. Tamer Özsu,et al.  A Poisson Model for User Accesses to Web Pages , 2003, ISCIS.

[3]  C. Lee Giles,et al.  A large-scale study of robots.txt , 2007, WWW '07.

[4]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[5]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[6]  Hassan Abolhassani,et al.  Freshness of Web search engines: Improving performance of Web search engines using data mining techniques , 2009, 2009 International Conference for Internet Technology and Secured Transactions, (ICITST).

[7]  Ricardo Baeza Yates,et al.  Characteristics of the Web of Spain , 2005 .

[8]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[9]  Martin Halvey,et al.  WWW '07: Proceedings of the 16th international conference on World Wide Web , 2007, WWW 2007.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Ricardo A. Baeza-Yates,et al.  Web Structure, Dynamics and Page Quality , 2002, SPIRE.

[12]  Donghua Pan,et al.  Web Page Content Extraction Method Based on Link Density and Statistic , 2008, 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing.

[13]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[14]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[15]  Ricardo A. Baeza-Yates,et al.  Characterization of national Web domains , 2007, TOIT.

[16]  Martin Höst,et al.  Web server traffic in crisis conditions , 2005 .

[17]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[18]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[19]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[20]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[21]  Adam Rifkin,et al.  Nutch: A Flexible and Scalable Open-Source Web Search Engine , 2005 .

[22]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[23]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[24]  Dirk Lewandowski,et al.  A three-year study on the freshness of web search engine databases , 2008, J. Inf. Sci..

[25]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[26]  Anthony T. Holdener Ajax: the definitive guide , 2008 .

[27]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.