Coherence-Oriented Crawling and Navigation Using Patterns for Web Archives

We point out, in this paper, the issue of improving the coherence of web archives under limited resources (e.g. bandwidth, storage space, etc.). Coherence measures how much a collection of archived pages versions reflects the real state (or the snapshot) of a set of related web pages at different points in time. An ideal approach to preserve the coherence of archives is to prevent pages content from changing during the crawl of a complete collection. However, this is practically infeasible because web sites are autonomous and dynamic. We propose two solutions: a priori and a posteriori. As a priori solution, our idea is to crawl sites during the off-peak hours (i.e. the periods of time where very little changes is expected on the pages) based on patterns. A pattern models the behavior of the importance of pages changes during a period of time. As an a posteriori solution, based on the same patterns, we introduce a novel navigation approach that enables users to browse the most coherent page versions at a given query time.

[1]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[2]  Gerhard Weikum,et al.  “Catch me if you can”: visual Analysis of Coherence Defects in Web Archiving , 2009 .

[3]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[4]  Gerhard Weikum,et al.  SHARC: Framework for Quality-Conscious Web Archiving , 2009, Proc. VLDB Endow..

[5]  Stéphane Gançarski,et al.  Vi-DIFF: Understanding Web Pages Changes , 2010, DEXA.

[6]  Stéphane Gançarski,et al.  Using visual pages analysis for optimizing web archiving , 2010, EDBT '10.

[7]  Stéphane Gançarski,et al.  Archiving the web using page changes patterns: a case study , 2011, JCDL '11.

[8]  Herbert Van de Sompel,et al.  Memento: Time Travel for the Web , 2009, ArXiv.

[9]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[10]  Gerhard Weikum,et al.  Data quality in web archiving , 2009, WICOW.

[11]  Stéphane Gançarski,et al.  Archiving the web using page changes patterns: a case study , 2011, JCDL '11.

[12]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[13]  Susan T. Dumais,et al.  Changing how people view changes on the web , 2009, UIST '09.

[14]  Satoshi Nakamura,et al.  A browser for browsing the past web , 2006, WWW '06.

[15]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[16]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.