Following the dynamic block on the Web

With the rapid changes in dynamic web pages, there is an increasing need for receiving instant updates for dynamic blocks on the Web. In this paper, we address the problem of automatically following dynamic blocks in web pages. Given a user-specified block on a web page, we continuously track the content of the block and report the updates in real time. This service can bring obvious benefits to users, such as the ability to track top-ten breaking news on CNN, the prices of iPhones on Amazon, or NBA game scores. We study 3,346 human labeled blocks from 1,127 pages, and analyze the effectiveness of four types of patterns, namely visual area, DOM tree path, inner content and close context, for tracking content blocks. Because of frequent web page changes, we find that the initial patterns generated on the original page could be invalidated over time, leading to the failure of extracting correct blocks. According to our observations, we combine different patterns to improve the accuracy and stability of block extractions. Moreover, we propose an adaptive model that adapts each pattern individually and adjusts pattern weights for an improved combination. The experimental results show that the proposed models outperform existing approaches, with the adaptive model performing the best.

[1]  Juliana Freire,et al.  WebViews: accessing personalized web content and services , 2001, WWW '01.

[2]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[3]  Bing Liu,et al.  Extracting Web Data Using Instance-Based Learning , 2005, World Wide Web.

[4]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[5]  Michael Boyle,et al.  Generating custom notification histories by tracking visual differences between web page visits , 2006, Graphics Interface.

[6]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[7]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[8]  Natalie S. Glance,et al.  ChangeDetector™: a site-level monitoring tool for the WWW , 2002, WWW '02.

[9]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[10]  Calton Pu,et al.  WebCQ-detecting and delivering information changes on the web , 2000, CIKM '00.

[11]  Craig A. Knoblock,et al.  Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction , 2003, IJCAI.

[12]  David Salesin,et al.  Summarizing personal web browsing sessions , 2006, UIST.

[13]  Susan T. Dumais,et al.  Changing how people view changes on the web , 2009, UIST '09.

[14]  Susan T. Dumais,et al.  A longitudinal study of how highlighting web content change affects people's web interactions , 2010, CHI.

[15]  Mira Dontcheva,et al.  Zoetrope: interacting with the ephemeral web , 2008, UIST '08.

[16]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[17]  Bing Liu,et al.  NET - A System for Extracting Web Data from Flat and Nested Data Records , 2005, WISE.

[18]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[19]  Rob Miller,et al.  Smart bookmarks: automatic retroactive macro recording on the web , 2007, UIST.

[20]  Atsushi Sugiura,et al.  Internet scrapbook: automating Web browsing tasks by demonstration , 1998, UIST '98.

[21]  Fred Douglis,et al.  The AT&T Internet Difference Engine: Tracking and viewing changes on the web , 1998, World Wide Web.

[22]  Yong Yu,et al.  Homepage live: automatic block tracing for web personalization , 2007, WWW '07.

[23]  Susan T. Dumais,et al.  Resonance on the web: web dynamics and revisitation patterns , 2009, CHI.

[24]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[25]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[26]  Eric Horvitz,et al.  Web montage: a dynamic personalized start page , 2002, WWW '02.

[27]  David Salesin,et al.  Changes in Webpage Structure over Time , 2007 .

[28]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[29]  Rahul Gupta,et al.  EShopMonitor: a Web content monitoring tool , 2004, Proceedings. 20th International Conference on Data Engineering.