Currently, a large number of web sites are generated from web templates so as to improve the productivity of web sites construction. However, the prevalence of web templates has a negative impact on the efficiency of search engine in many aspects, including the relevance judgment of web IR and resource usage of analysis tool. In this paper, we present a direct and fast method to detect pages of the same template by DOM tree characteristics. After analyzing and compressing DOM tree nodes of the HTML page, our method generates a hash value digest, also called fingerprint, for each page to identify its DOM structure. In addition, we also introduce some other page features to aid in judging the page template type. Through experimental evaluations over thirty thousand sub-domains, we show that our approach can obtain the analysis results rapidly but with a high accuracy rate above 95 percents.
[1]
Andrew Tomkins,et al.
The volume and evolution of web page templates
,
2005,
WWW '05.
[2]
Xiaoli Li,et al.
Eliminating noisy information in Web pages for data mining
,
2003,
KDD '03.
[3]
H. V. Jagadish,et al.
Evaluating Structural Similarity in XML Documents
,
2002,
WebDB.
[4]
Soumen Chakrabarti,et al.
Enhanced topic distillation using text, markup tags, and hyperlinks
,
2001,
SIGIR '01.
[5]
Juliana Freire,et al.
A fast and robust method for web page template detection and removal
,
2006,
CIKM '06.
[6]
Robert Richards,et al.
Document Object Model (DOM)
,
2006
.
[7]
Ziv Bar-Yossef,et al.
Template detection via data mining and its applications
,
2002,
WWW.