Archiving the web using page changes patterns: a case study

A pattern is a model or a template used to summarize and describe the behavior (or the trend) of data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend) or, more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of Web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive Websites. We first define our pattern model that describes the importance of page changes. Then, we present the strategy used to (i) extract the temporal evolution of page changes, (ii) discover patterns, to (iii) exploit them to improve Web archives. The archive of French public TV channels France Télévisions is chosen as a case study to validate our approach. Our experimental evaluation based on real Web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.

[1]  Yuanyuan Zhou,et al.  Association Proceedings of the Third USENIX Conference on File and Storage Technologies San Francisco , CA , USA March 31 – April 2 , 2004 , 2004 .

[2]  Frank M. Shipman,et al.  Managing change on the web , 2001, JCDL '01.

[3]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[4]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[5]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[6]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[7]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[8]  Gerhard Weikum,et al.  SHARC: Framework for Quality-Conscious Web Archiving , 2009, Proc. VLDB Endow..

[9]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[10]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[11]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[12]  Stéphane Gançarski,et al.  Vi-DIFF: Understanding Web Pages Changes , 2010, DEXA.

[13]  Yun Chi,et al.  Monitoring RSS Feeds Based on User Browsing Pattern , 2007, ICWSM.

[14]  Frank M. Shipman,et al.  Application of kalman filters to identify unexpected change in blogs , 2008, JCDL '08.

[15]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[16]  Susan T. Dumais,et al.  Resonance on the web: web dynamics and revisitation patterns , 2009, CHI.

[17]  Serge Abiteboul,et al.  A First Experience in Archiving the French Web , 2002, ECDL.

[18]  Hayato Yamana,et al.  Exploiting idle CPU cores to improve file access performance , 2009, ICUIMC '09.

[19]  Andreas Rauber,et al.  Proceedings of the 10th International Web Archiving Workshop (IWAW 2010), in conjunction with the 7th International Conference on Preservation of Digital Objects (iPRES2010), Vienna, Austria, September 22 - September 23, 2010. , 2009 .

[20]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[21]  Frank M. Shipman,et al.  Perception of content, structure, and presentation changes in Web-based hypertext , 2001, Hypertext.

[22]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[23]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[24]  Daniel Gomes,et al.  Managing duplicates in a web archive , 2006, SAC.

[25]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[26]  Hyun-Kyu Cho,et al.  Efficient Monitoring Algorithm for Fast News Alerts , 2007, IEEE Transactions on Knowledge and Data Engineering.

[27]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[28]  G. Weikum,et al.  The SOLAR System for Sharp Web Archiving , 2010 .

[29]  Marilena Oita,et al.  Archiving Data Objects using Web Feeds , 2010 .

[30]  Michalis Vazirgiannis,et al.  Archiving the Greek Web , 2004 .

[31]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[32]  Stéphane Gançarski,et al.  Using visual pages analysis for optimizing web archiving , 2010, EDBT '10.

[33]  Julien Masanès Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[34]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[35]  Ricardo A. Baeza-Yates,et al.  Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[36]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[37]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[38]  Gerhard Weikum,et al.  “Catch me if you can”: visual Analysis of Coherence Defects in Web Archiving , 2009 .

[39]  Risto Vaarandi,et al.  A data clustering algorithm for mining patterns from event logs , 2003, Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764).

[40]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[41]  Frank M. Shipman,et al.  Longitudinal study of changes in blogs , 2007, JCDL '07.

[42]  Gerhard Weikum,et al.  Data quality in web archiving , 2009, WICOW.

[43]  Kanak Saxena,et al.  Significant Interval and Frequent Pattern Discovery in Web Log Data , 2010, ArXiv.

[44]  Myra Spiliopoulou,et al.  Monitoring the Evolution of Web Usage Patterns , 2003, EWMF.

[45]  Mong-Li Lee,et al.  Efficient Mining of XML Query Patterns for Caching , 2003, VLDB.