A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations

This paper describes a fast HTML web page detection approach that saves computation time by limiting the similarity computations between two versions of a web page to nodes having the same HTML tag type, and by hashing the web page in order to provide direct access to node information. This efficient approach is suitable as a client application and for implementing server applications that could serve the needs of users in monitoring modifications to HTML web pages made over time, and that allow for reporting and visualizing changes and trends in order to gain insight about the significance and types of such changes. The detection of changes across two versions of a page is accomplished by performing similarity computations after transforming the web page into an XML-like structure in which a node corresponds to an open-close HTML tag. Performance and detection reliability results were obtained, and showed speed improvements when compared to the results of a previous approach.

[1]  Elio Masciari,et al.  Efficient and effective Web change detection , 2003, Data Knowl. Eng..

[2]  Norman Matloff Estimation of internet file-access/modification rates from indirect data , 2005, TOMC.

[3]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[4]  Calton Pu,et al.  Information Monitoring on the Web: A Scalable Solution , 2004, World Wide Web.

[5]  J. W. Hunt,et al.  An Algorithm for Differential File Comparison , 2008 .

[6]  Kaizhong Zhang,et al.  On the Editing Distance between Undirected Acyclic Graphs and Related Problems , 1995, CPM.

[7]  Michal Cutler,et al.  The portrait of a common HTML web page , 2006, DocEng '06.

[8]  Sharma Chakravarthy,et al.  CX-DIFF: a change detection algorithm for XML content and change visualization for WebVigiL , 2005, Data Knowl. Eng..

[9]  Sharma Chakravarthy,et al.  WebVigil: An approach to Just-In-Time Information Propagation In Large Network-Centric Environments , 2002, WebDyn@WWW.

[10]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[11]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[12]  Fred Douglis,et al.  The AT&T Internet Difference Engine: Tracking and viewing changes on the web , 1998, World Wide Web.

[13]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[14]  Oussama El-Rawas,et al.  An Efficient Web Page Change Detection System Based on an Optimized Hungarian Algorithm , 2007, IEEE Transactions on Knowledge and Data Engineering.

[15]  Yiu-Kai Ng,et al.  An automated change-detection algorithm for HTML documents based on semantic hierarchies , 2001, Proceedings 17th International Conference on Data Engineering.

[16]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[17]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[18]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[19]  Calton Pu,et al.  WebCQ-detecting and delivering information changes on the web , 2000, CIKM '00.

[20]  Allison Woodruff,et al.  An Investigation of Documents from the World Wide Web , 1996, Comput. Networks.

[21]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[22]  H. Kuhn The Hungarian method for the assignment problem , 1955 .