Natural Document Clustering by Clique Percolation in Random Graphs

Document clustering techniques mostly depend on models that impose explicit and/or implicit priori assumptions as to the number, size, disjunction characteristics of clusters, and/or the probability distribution of clustered data. As a result, the clustering effects tend to be unnatural and stray away more or less from the intrinsic grouping nature among the documents in a corpus. We propose a novel graph-theoretic technique called Clique Percolation Clustering (CPC). It models clustering as a process of enumerating adjacent maximal cliques in a random graph that unveils inherent structure of the underlying data, in which we unleash the commonly practiced constraints in order to discover natural overlapping clusters. Experiments show that CPC can outperform some typical algorithms on benchmark data sets, and shed light on natural document clustering.

[1]  Anupam Joshi,et al.  Low-complexity fuzzy relational clustering algorithms for Web mining , 2001, IEEE Trans. Fuzzy Syst..

[2]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[3]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .

[4]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[5]  Shuji Tsukiyama,et al.  A New Algorithm for Generating All the Maximal Independent Sets , 1977, SIAM J. Comput..

[6]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[7]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[8]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[9]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[10]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[11]  Vijay V. Raghavan,et al.  A Comparison of the Stability Characteristics of Some Graph Theoretic Clustering Methods , 1981, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Xin Liu,et al.  Document clustering with cluster refinement and model selection capabilities , 2002, SIGIR '02.

[13]  Robert R. Sokal,et al.  The First Decade of Numerical Taxonomy. (Book Reviews: Numerical Taxonomy. The Principles and Practice of Numerical Classification) , 1975 .

[14]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[15]  T. Vicsek,et al.  Clique percolation in random networks. , 2005, Physical review letters.

[16]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[17]  Peter H. A. Sneath,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[18]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[19]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[20]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[21]  S. N. Dorogovtsev,et al.  Evolution of networks , 2001, cond-mat/0106144.