Deriving Dynamics of Web Pages: A Survey

The World Wide Web is dynamic by nature: content is continuously added, deleted, or changed, which makes it challenging for Web crawlers to keep up-to-date with the current version of a Web page, all the more so since not all apparent changes are significant ones. We review major approaches to change detection in Web pages and extraction of temporal properties (especially, timestamps) of Web pages. We focus our attention on techniques and systems that have been proposed in the last ten years and we analyze them to get some insight into the practical solutions and best practices available. We aim at providing an analytical view of the range of methods that can be used, distinguishing them on several dimensions, especially, their static or dynamic nature, the modeling of Web pages, or, for dynamic methods relying on comparison of successive versions of a page, the similarity metrics used. We advocate for more comprehensive studies of the effectiveness of Web page change detection methods, and finally highlight open issues.

[1]  Gregory Cobena,et al.  A Comparative Study of XML Change Detection Algorithms , 2009 .

[2]  Elio Masciari,et al.  Efficient and effective Web change detection , 2003, Data Knowl. Eng..

[3]  Daniel Rocco,et al.  Page Digest for large-scale Web services , 2003, EEE International Conference on E-Commerce, 2003. CEC 2003..

[4]  P. P. Halkarnikar,et al.  A Novel Approach for Web Page Change Detection System , 2010 .

[5]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[7]  Sharma Chakravarthy,et al.  WebVigil: An approach to Just-In-Time Information Propagation In Large Network-Centric Environments , 2002, WebDyn@WWW.

[8]  A. K. Sharma,et al.  Change Detection in Web Pages , 2007, 10th International Conference on Information Technology (ICIT 2007).

[9]  Adam Jatowt,et al.  Detecting age of page content , 2007, WIDM '07.

[10]  Oussama El-Rawas,et al.  An Efficient Web Page Change Detection System Based on an Optimized Hungarian Algorithm , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Lars R. Clausen,et al.  Concerning Etags and Datestamps , 2004 .

[12]  Serge Abiteboul Issues in Monitoring Web Data , 2002, DEXA.

[13]  Frank M. Shipman,et al.  Application of kalman filters to identify unexpected change in blogs , 2008, JCDL '08.

[14]  Hyun-Kyu Cho,et al.  Efficient Monitoring Algorithm for Fast News Alerts , 2007, IEEE Transactions on Knowledge and Data Engineering.

[15]  Fred Douglis,et al.  The AT&T Internet Difference Engine: Tracking and viewing changes on the web , 1998, World Wide Web.

[16]  Ricardo A. Baeza-Yates,et al.  Web Dynamics, Structure, and Page Quality , 2004, Web Dynamics.

[17]  Dimitri P. Bertsekas,et al.  Parallel Asynchronous Hungarian Methods for the Assignment Problem , 1993, INFORMS J. Comput..

[18]  J. P. Gupta,et al.  Parallel crawler architecture and web page change detection , 2008 .

[19]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[20]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[21]  Calton Pu,et al.  WebCQ-detecting and delivering information changes on the web , 2000, CIKM '00.

[22]  Hassan Artail,et al.  A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations , 2008, Data Knowl. Eng..

[23]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[24]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[25]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[26]  Stéphane Gançarski,et al.  A Novel Web Archiving Approach based on Visual Pages Analysis , 2009 .

[27]  Yiu-Kai Ng,et al.  An automated change-detection algorithm for HTML documents based on semantic hierarchies , 2001, Proceedings 17th International Conference on Data Engineering.

[28]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[29]  Marilena Oita,et al.  Archiving Data Objects using Web Feeds , 2010 .

[30]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[31]  Kristinn Sigurðsson Incremental Crawling with Heritrix , 2010 .

[32]  Cristina Ribeiro,et al.  Using neighbors to date web documents , 2007, WIDM '07.

[33]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[34]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[35]  Wei-Ying Ma,et al.  Improving pseudo-relevance feedback in web information retrieval using web page segmentation , 2003, WWW '03.

[36]  Süleyman Cenk Sahinalp,et al.  Hardness of String Similarity Search and Other Indexing Problems , 2004, ICALP.