Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

Abstract Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are “similar” to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in dete...

[1]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD 2000.

[2]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[3]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[4]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[5]  Wen-tau Yih,et al.  Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[6]  Shunkai Fu,et al.  SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages , 2009 .

[7]  Mehran Sahami,et al.  Evaluating similarity measures: a large-scale study in the orkut social network , 2005, KDD '05.

[8]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[9]  Jack G. Conrad,et al.  Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.

[10]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[11]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[12]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[13]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[14]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[15]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[16]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[17]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[18]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[19]  Abdur Chowdhury,et al.  Lexicon randomization for near-duplicate detection with I-Match , 2007, The Journal of Supercomputing.

[20]  D. Binu,et al.  An approach to products placement in supermarkets using PrefixSpan algorithm , 2013, J. King Saud Univ. Comput. Inf. Sci..

[21]  Neha Aggarwal,et al.  Query Based Duplicate Data Detection on WWW , 2010 .

[22]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Michael L. Nelson,et al.  Evaluation of crawling policies for a web-repository crawler , 2006, HYPERTEXT '06.

[24]  Xia Hong-xia Design and Implementation of Web Information gathering System , 2009 .

[25]  Filippo Menczer,et al.  Crawling the Web , 2004, Web Dynamics.

[26]  Yoav Shoham,et al.  Learning Information Retrieval Agents: Experiments with Automated Web Browsing , 1995 .

[27]  Dunja Mladenic,et al.  A Roadmap for Web Mining: From Web to Semantic Web , 2003, EWMF.

[28]  Xueqi Cheng,et al.  Detecting Near-Duplicates in Large-Scale Short Text Databases , 2008, PAKDD.

[29]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[30]  A. Govardhan,et al.  Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling , 2010, ADMA.

[31]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[32]  Ravi Kumar,et al.  Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.

[33]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[34]  A. Govardhan,et al.  To create a confusion matrix in respect of threshold being fixed for effective detection of near duplicate web documents in Web Crawling , 2011, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).

[35]  Jenq-Haur Wang,et al.  Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries , 2007, ICADL.