Hypergraph Based Document Categorization: Frequent Itemsets vs Hypercliques

This paper describes a new hypergraph formulation for document categorization, where hyperclique patterns, strongly affiliated documents in this case, are used as hyperedges. Compared to frequent itemsets, the objects in a hyperclique pattern have a guaranteed level of global pairwise similarity to one another as measured by the cosine or Jaccard similarity measure. Since hypergraph partitioning is mainly based on vertex similairty on the hyperedge, hypercliques may serve as better quality hyperedges. Besides, due to the additional confidence constraint, we can cover more items in the mined patterns while keep the pattern size reasonable. Hence, the difficulty in partitioning dense hypergraphs, which is often encountered in frequent itemset based hypergraph partitioning, is alleviated considerably. Finally, experiments with real-world datasets show that, with hyperclique patterns as hyperedges, we can improve the clustering results in terms of various external validation measures.

[1]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[2]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[3]  Hui Xiong,et al.  Hyperclique pattern discovery , 2006, Data Mining and Knowledge Discovery.

[4]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: application in VLSI domain , 1997, DAC.

[5]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[6]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[7]  Vipin Kumar,et al.  Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results , 1998, IEEE Data Eng. Bull..

[8]  Hui Xiong,et al.  A hybrid approach for mining maximal hyperclique patterns , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[9]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[10]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[11]  Inderjit S. Dhillon,et al.  A fast kernel-based multilevel algorithm for graph clustering , 2005, KDD '05.

[12]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: applications in VLSI domain , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[13]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[14]  Cevdet Aykanat,et al.  Hypergraph Models and Algorithms for Data-Pattern-Based Clustering , 2004, Data Mining and Knowledge Discovery.

[15]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[16]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[17]  Claude Berge,et al.  Graphs and Hypergraphs , 2021, Clustering.

[18]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[19]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[20]  Li Wei,et al.  HOT: Hypergraph-Based Outlier Test for Categorical Data , 2003, PAKDD.

[21]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .