Near-duplicate web page detection: A comparative study of two contrary approaches

Detection of duplicate and near-duplicate web pages has attracted voluminous research among the web crawling research community. There have been a considerable number of significant researches available in the literature for near-duplicate detection, but, none has been accepted as a universal solution. G.S. Manku et al.'s fingerprint based approach proposed in 2007 was considered as one of the “state-of-the-art" algorithms for finding near-duplicate web pages. In our earlier work, we had devised an efficient similarity score based approach for near-duplicate web page detection. The experimentation on the proposed approach has showed that it has achieved almost detection accuracy identical to G.S. Manku et al.'s fingerprint based approach. Hence, in this paper, we conduct an extensive comparative study between our similarity score based approach and G.S. Manku et al.'s fingerprint based approach in terms of the computational factors namely: 1) Time and 2) Storage space. The performances of the two approaches were considered to be ideally the same, and so, we take up complexity measures namely time and memory space to determine the better approach of the two. The comparison study clearly portrays the better (less complex) of the two approaches based on the factors considered.