论文信息 - Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

With the rapid development and usage of World Wide Web, there are a huge number of duplicate web pages. To help the search engine for providing results free from duplicates, detection and elimination of duplicates is required. The proposed approach combines the strength of some "state of the art" duplicate detection algorithms like Shingling and Simhash to efficiently detect and eliminate near duplicate web pages while considering some important factors like word order. In addition, it employs Latent Semantic Indexing (LSI) to detect conceptually similar documents which are often not detected by textual based duplicate detection techniques like Shingling and Simhash. The approach utilizes hamming distance and cosine similarity (for textual and conceptual duplicate detection respectively) between two documents as their similarity measure. For performance measurement, the F-measure of the proposed approach is compared with the traditional Simhash technique. Experimental results show that our approach can outperform the traditional Simhash.

[1] Wei Wang,et al. Near Duplicate Text Detection Using Frequency-Biased Signatures , 2013, WISE.

[2] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[3] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[4] Daniel T. Larose,et al. Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[5] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.

[6] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[7] Josef Stoer,et al. Numerische Mathematik 1 , 1989 .

[8] Stephen E. Robertson,et al. Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[9] Ángel F. Zazo Rodríguez,et al. Web Document Duplicate Detection Using Fuzzy Hashing , 2011, PAAMS.

[10] Gene H. Golub,et al. Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[11] H. Bast,et al. Fast error-tolerant search on very large texts , 2009, SAC '09.

[12] Andreas Paepcke,et al. SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[13] Feng Zhang,et al. Research on New Algorithm of Topic-Oriented Crawler and Duplicated Web Pages Detection , 2012, ICIC.