Preserving Patterns in Bipartite Graph Partitioning

This paper describes a new bipartite formulation for word-document co-clustering such that hyperclique patterns, strongly affiliated documents in this case, are guaranteed not to be split into different clusters. Our approach for pattern preserving clustering consists of three steps: mine maximal hyperclique patterns, form the bipartite, and partition it. With hyperclique patterns of documents preserved, the topic of each cluster can be represented by both the top words from that cluster and the documents in the patterns, which are expected to be more compact and representative than those in the standard bipartite formulation. Experiments with real-world datasets show that, with hyperclique patterns as starting points, we can improve the clustering results in terms of various external clustering criteria. Also, the partitioned bipartite with preserved topical sets of documents naturally lends itself to different functions in search engines

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Hui Xiong,et al.  A hybrid approach for mining maximal hyperclique patterns , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[3]  Hui Xiong,et al.  HICAP: Hierarchical Clustering with Pattern Preservation , 2004, SDM.

[4]  Anthony K. H. Tung,et al.  Constraint-based clustering in large databases , 2001, ICDT.

[5]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[6]  Hui Xiong,et al.  Mining strong affinity association patterns in data sets with skewed support distribution , 2003, Third IEEE International Conference on Data Mining.

[7]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[8]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[9]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[10]  Vipin Kumar,et al.  WebACE: a Web agent for document categorization and exploration , 1998, AGENTS '98.

[11]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[12]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[13]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[15]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[16]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[17]  Inderjit S. Dhillon,et al.  A fast kernel-based multilevel algorithm for graph clustering , 2005, KDD '05.

[18]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[19]  Konstantin Andreev,et al.  Balanced Graph Partitioning , 2004, SPAA '04.

[20]  Hui Xiong,et al.  Hyperclique pattern discovery , 2006, Data Mining and Knowledge Discovery.

[21]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[22]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[23]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[24]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[25]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[26]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[27]  Philip S. Yu,et al.  MaPle: a fast algorithm for maximal pattern-based clustering , 2003, Third IEEE International Conference on Data Mining.

[28]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[29]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..