Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
暂无分享,去创建一个
[1] Hector Garcia-Molina,et al. Finding replicated Web collections , 2000, SIGMOD 2000.
[2] Ellen Spertus,et al. ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.
[3] Justin Zobel,et al. Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..
[4] Koichi Takeda,et al. Information retrieval on the web , 2000, CSUR.
[5] Wen-tau Yih,et al. Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.
[6] Shunkai Fu,et al. SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages , 2009 .
[7] Mehran Sahami,et al. Evaluating similarity measures: a large-scale study in the orkut social network , 2005, KDD '05.
[8] Eiríkur Rögnvaldsson,et al. A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.
[9] Jack G. Conrad,et al. Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.
[10] Jeffrey Xu Yu,et al. Efficient similarity joins for near-duplicate detection , 2011, TODS.
[11] Roberto J. Bayardo,et al. Scaling up all pairs similarity search , 2007, WWW '07.
[12] Marc Najork,et al. On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).
[13] Julie Beth Lovins,et al. Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.
[14] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.
[15] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.
[16] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.
[17] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.
[18] Sriram Raghavan,et al. Searching the Web , 2001, ACM Trans. Internet Techn..
[19] Abdur Chowdhury,et al. Lexicon randomization for near-duplicate detection with I-Match , 2007, The Journal of Supercomputing.
[20] D. Binu,et al. An approach to products placement in supermarkets using PrefixSpan algorithm , 2013, J. King Saud Univ. Comput. Inf. Sci..
[21] Neha Aggarwal,et al. Query Based Duplicate Data Detection on WWW , 2010 .
[22] Anil K. Jain,et al. Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[23] Michael L. Nelson,et al. Evaluation of crawling policies for a web-repository crawler , 2006, HYPERTEXT '06.
[24] Xia Hong-xia. Design and Implementation of Web Information gathering System , 2009 .
[25] Filippo Menczer,et al. Crawling the Web , 2004, Web Dynamics.
[26] Yoav Shoham,et al. Learning Information Retrieval Agents: Experiments with Automated Web Browsing , 1995 .
[27] Dunja Mladenic,et al. A Roadmap for Web Mining: From Web to Semantic Web , 2003, EWMF.
[28] Xueqi Cheng,et al. Detecting Near-Duplicates in Large-Scale Short Text Databases , 2008, PAKDD.
[29] Ohn Mar San,et al. An alternative extension of the k-means algorithm for clustering categorical data , 2004 .
[30] A. Govardhan,et al. Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling , 2010, ADMA.
[31] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.
[32] Ravi Kumar,et al. Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.
[33] Hector Garcia-Molina,et al. Finding replicated Web collections , 2000, SIGMOD '00.
[34] A. Govardhan,et al. To create a confusion matrix in respect of threshold being fixed for effective detection of near duplicate web documents in Web Crawling , 2011, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).
[35] Jenq-Haur Wang,et al. Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries , 2007, ICADL.