Frequent Itemset Mining for Clustering Near Duplicate Web Documents

A vast amount of documents in the Web have duplicates, which is a challenge for developing efficient methods that would compute clusters of similar documents. In this paper we use an approach based on computing (closed) sets of attributes having large support (large extent) as clusters of similar documents. The method is tested in a series of computer experiments on large public collections of web documents and compared to other established methods and software, such as biclustering, on same datasets. Practical efficiency of different algorithms for computing frequent closed sets of attributes is compared.

[1]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[2]  Sergei O. Kuznetsov,et al.  Comparing performance of algorithms for generating concept lattices , 2002, J. Exp. Theor. Artif. Intell..

[3]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[4]  Christian Borgelt,et al.  EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .

[5]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[6]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[7]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[8]  Ahmad M. Hasnah,et al.  A New Filtering Algorithm for Duplicate Document Based on Concept Analysis , 2006 .

[9]  Derrick G. Kourie,et al.  AddIntent: A New Incremental Algorithm for Constructing Concept Lattices , 2004, ICFCA.

[10]  Bart Goethals,et al.  Advances in Frequent Itemset Mining Implementations: Introduction to FIMI03 , 2003, FIMI.

[11]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[12]  Benno Stein,et al.  New Issues in Near-duplicate Detection , 2007, GfKl.

[13]  Bart Goethals,et al.  Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations (FIMI'03) , 2003 .

[14]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[15]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[16]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[17]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[18]  Hongjun Lu,et al.  AFOPT: An Efficient Implementation of Pattern Growth Approach , 2003, FIMI.

[19]  Jean-François Boulicaut,et al.  Constraint-Based Mining of Formal Concepts in Transactional Data , 2004, PAKDD.

[20]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[21]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[22]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[23]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[24]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[25]  Ilya Valentinovich Segalovich,et al.  An efficient method to detect duplicates of web documents with the use of inverted index , 2002, WWW 2002.

[26]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[27]  Johannes Gehrke,et al.  MAFIA: A Performance Study of Mining Maximal Frequent Itemsets , 2003, FIMI.

[28]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..