High-dimensional clustering: a clique-based hypergraph partitioning framework

Hypergraph partitioning has been considered as a promising method to address the challenges of high-dimensional clustering. With objects modeled as vertices and the relationship among objects captured by the hyperedges, the goal of graph partitioning is to minimize the edge cut. Therefore, the definition of hyperedges is vital to the clustering performance. While several definitions of hyperedges have been proposed, a systematic understanding of desired characteristics of hyperedges is still missing. To that end, in this paper, we first provide a unified clique perspective of the definition of hyperedges, which serves as a guide to define hyperedges. With this perspective, based on the concepts of shared (reverse) nearest neighbors, we propose two new types of clique hyperedges and analyze their properties regarding purity and size issues. Finally, we present an extensive evaluation using real-world document datasets. The experimental results show that, with shared (reverse) nearest neighbor-based hyperedges, the clustering performance can be improved significantly in terms of various external validation measures without the need for fine tuning of parameters.

[1]  Argyris Kalogeratos,et al.  Text document clustering using global term context vectors , 2011, Knowledge and Information Systems.

[2]  Korris Fu-Lai Chung,et al.  Knowledge and Information Systems , 2017 .

[3]  Pulak Bandyopadhyay,et al.  A domain-specific decision support system for knowledge discovery using association and text mining , 2011, Knowledge and Information Systems.

[4]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[5]  Sam Yuan Sung,et al.  Knowledge and Information Systems , 2006 .

[6]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[7]  Cevdet Aykanat,et al.  Hypergraph Models and Algorithms for Data-Pattern-Based Clustering , 2004, Data Mining and Knowledge Discovery.

[8]  Beng Chin Ooi,et al.  BORDER: efficient computation of boundary points , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[10]  Hui Xiong,et al.  Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization , 2012, Inf. Sci..

[11]  S. Muthukrishnan,et al.  Influence sets based on reverse nearest neighbor queries , 2000, SIGMOD '00.

[12]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[13]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: application in VLSI domain , 1997, DAC.

[14]  Ujjwal Maulik,et al.  An evolutionary technique based on K-Means algorithm for optimal clustering in RN , 2002, Inf. Sci..

[15]  Hui Xiong,et al.  Mining maximal hyperclique pattern: A hybrid search strategy , 2007, Inf. Sci..

[16]  Hui Xiong,et al.  Which Distance Metric is Right: An Evolutionary K-Means View , 2012, SDM.

[17]  Kamalakar Karlapalem,et al.  A Simple Yet Effective Data Clustering Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[18]  Tsau Young Lin,et al.  A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering , 2005, Int. J. Approx. Reason..

[19]  Samah Jamal Fodeh,et al.  On ontology-driven document clustering using core semantic features , 2011, Knowledge and Information Systems.

[20]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: applications in VLSI domain , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[21]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[22]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[23]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[24]  Frank S. C. Tseng,et al.  An integration of fuzzy association rules and WordNet for document clustering , 2010, Knowledge and Information Systems.

[25]  Hui Xiong,et al.  Hyperclique pattern discovery , 2006, Data Mining and Knowledge Discovery.

[26]  Hui Xiong,et al.  Discovery of maximum length frequent itemsets , 2008, Inf. Sci..

[27]  Hui Xiong,et al.  Co-Clustering Bipartite with Pattern Preservation for Topic Extraction , 2008, Int. J. Artif. Intell. Tools.

[28]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[29]  Vipin Kumar,et al.  Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results , 1998, IEEE Data Eng. Bull..

[30]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[31]  Zhi Lu,et al.  Short text clustering by finding core terms , 2011, Knowledge and Information Systems.