论文信息 - Distributed and collaborative Web Change Detection system - 字舞流文

Distributed and collaborative Web Change Detection system

Search engines use crawlers to traverse the Web in order to download web pages and build their indexes. Maintaining these indexes up-to-date is an essential task to ensure the quality of search results. However, changes in web pages are unpredictable. Identifying the moment when a web page changes as soon as possible and with minimal computational cost is a major challenge. In this article we present the Web Change Detection system that, in a best case scenario, is capable to detect, almost in real time, when a web page changes. In a worst case scenario, it will require, on average, 12 minutes to detect a change on a low PageRank web site and about one minute on a web site with high PageRank. Meanwhile, current search engines require more than a day, on average, to detect a modification in a web page (in both cases).

Victor Carneiro | Fidel Cacheda | Manuel Álvarez | Víctor M. Prieto | Fidel Cacheda | V. Carneiro | M. Álvarez

[1] George Cybenko,et al. How dynamic is the Web? , 2000, Comput. Networks.

[2] M. Tamer Özsu,et al. A Poisson Model for User Accesses to Web Pages , 2003, ISCIS.

[3] C. Lee Giles,et al. A large-scale study of robots.txt , 2007, WWW '07.

[4] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.

[5] Sriram Raghavan,et al. Searching the Web , 2001, ACM Trans. Internet Techn..

[6] Hassan Abolhassani,et al. Freshness of Web search engines: Improving performance of Web search engines using data mining techniques , 2009, 2009 International Conference for Internet Technology and Secured Transactions, (ICITST).

[7] Ricardo Baeza Yates,et al. Characteristics of the Web of Spain , 2005 .

[8] Marc Najork,et al. A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[9] Martin Halvey,et al. WWW '07: Proceedings of the 16th international conference on World Wide Web , 2007, WWW 2007.

[10] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11] Ricardo A. Baeza-Yates,et al. Web Structure, Dynamics and Page Quality , 2002, SPIRE.

[12] Donghua Pan,et al. Web Page Content Extraction Method Based on Link Density and Statistic , 2008, 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing.

[13] B. Huberman,et al. The Deep Web : Surfacing Hidden Value , 2000 .

[14] Hector Garcia-Molina,et al. Estimating frequency of change , 2003, TOIT.

[15] Ricardo A. Baeza-Yates,et al. Characterization of national Web domains , 2007, TOIT.

[16] Martin Höst,et al. Web server traffic in crisis conditions , 2005 .

[17] Hector Garcia-Molina,et al. The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[18] Antonio Gulli,et al. The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[19] Sandeep Pandey,et al. Recrawl scheduling based on information longevity , 2008, WWW.

[20] Hector Garcia-Molina,et al. Web Spam Taxonomy , 2005, AIRWeb.

[21] Adam Rifkin,et al. Nutch: A Flexible and Scalable Open-Source Web Search Engine , 2005 .

[22] Chun Zhang,et al. Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[23] Marc Najork,et al. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[24] Dirk Lewandowski,et al. A three-year study on the freshness of web search engine databases , 2008, J. Inf. Sci..

[25] Hector Garcia-Molina,et al. Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[26] Anthony T. Holdener. Ajax: the definitive guide , 2008 .

[27] Sriram Raghavan,et al. Crawling the Hidden Web , 2001, VLDB.