A fast and robust method for web page template detection and removal

The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First, the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and classification methods.

[1]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[2]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[3]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[4]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[5]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[6]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[7]  Jakob Nielsen,et al.  User interface directions for the Web , 1999, CACM.

[8]  Sudarshan S. Chawathe,et al.  Comparing Hierarchical Data in External Memory , 1999, VLDB.

[9]  Andrei Z. Broder,et al.  A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[10]  Weimin Chen,et al.  New Algorithm for Ordered Tree-to-Tree Correction Problem , 2001, J. Algorithms.

[11]  Soumen Chakrabarti,et al.  Enhanced topic distillation using text, markup tags, and hyperlinks , 2001, SIGIR '01.

[12]  Gabriel Valiente,et al.  An Efficient Bottom-Up Distance between Trees , 2001, SPIRE.

[13]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[14]  Kaizhong Zhang,et al.  Finding similar consensus between trees: an algorithm and a distance hierarchy , 2001, Pattern Recognit..

[15]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[16]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[17]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[18]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[19]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[20]  Mario A. Nascimento,et al.  Improving Web search efficiency via a locality based static pruning method , 2005, WWW '05.

[21]  Sandip Debnath,et al.  Automatic extraction of informative blocks from webpages , 2005, SAC '05.

[22]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.