Duplicate Web Page Elimination Based on HTML and Extraction of Long Sentence

We have developed an efficient algorithm to eliminate the duplicate web pages. This algorithm takes advantage of HTML tags to filter the noise of a page, and extracts those long sentences that can represent a page, as the features of the page. And we use the number of long sentences that shared by two pages, as the metric of duplication. This algorithm uses a red-black tree to index those long sentences, and changes the elimination process into a search process. So that it can reduce the running time. The result of our experiments shows that this algorithm can efficiently and correctly eliminate duplicate web pages.