A Direct Web Page Templates Detection Method

Currently, a large number of web sites are generated from web templates so as to improve the productivity of web sites construction. However, the prevalence of web templates has a negative impact on the efficiency of search engine in many aspects, including the relevance judgment of web IR and resource usage of analysis tool. In this paper, we present a direct and fast method to detect pages of the same template by DOM tree characteristics. After analyzing and compressing DOM tree nodes of the HTML page, our method generates a hash value digest, also called fingerprint, for each page to identify its DOM structure. In addition, we also introduce some other page features to aid in judging the page template type. Through experimental evaluations over thirty thousand sub-domains, we show that our approach can obtain the analysis results rapidly but with a high accuracy rate above 95 percents.