Hierarchical document clustering using local patterns

The global pattern mining step in existing pattern-based hierarchical clustering algorithms may result in an unpredictable number of patterns. In this paper, we propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC first discovers locally promising patterns by allowing each instance to “vote” for its representative size-2 patterns in a way that ensures an effective balance between local pattern frequency and pattern significance in the dataset. The cluster hierarchy (i.e., the global model) is then directly constructed using these locally promising patterns as features. Each pattern forms an initial (possibly overlapping) cluster, and the rest of the cluster hierarchy is obtained by following a unique iterative cluster refinement process. By effectively utilizing instance-to-cluster relationships, this process directly identifies clusters for each level in the hierarchy, and efficiently prunes duplicate clusters. Furthermore, IDHC produces cluster labels that are more descriptive (patterns are not artificially restricted), and adapts a soft clustering scheme that allows instances to exist in suitable nodes at various levels in the cluster hierarchy. We present results of experiments performed on 16 standard text datasets, and show that IDHC outperforms state-of-the-art hierarchical clustering algorithms in terms of average entropy and FScore measures.

[1]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[2]  LiuHuan,et al.  Subspace clustering for high dimensional data , 2004 .

[3]  H. Mannila,et al.  Discovering all most specific sentences , 2003, TODS.

[4]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[5]  Vipin Kumar,et al.  Clustering Based On Association Rule Hypergraphs , 1997, DMKD.

[6]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[7]  Soon Myoung Chung,et al.  Text document clustering based on frequent word sequences , 2005, CIKM '05.

[8]  Klaus Brinker,et al.  Any-time clustering of high frequency news streams , 2007 .

[9]  Jiawei Han,et al.  Scalable construction of topic directory with nonparametric closed termset mining , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  Luigi Palopoli,et al.  On the Complexity of Mining Association Rules , 2001, SEBD.

[13]  John R. Kender,et al.  Instance Driven Hierarchical Clustering of Document Collections , 2008 .

[14]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[15]  Geert Wets,et al.  Defining interestingness for association rules , 2003 .

[16]  Chris Clifton,et al.  TopCat: data mining for topic identification in a text corpus , 1999, IEEE Transactions on Knowledge and Data Engineering.

[17]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[18]  Jianyong Wang,et al.  SUMMARY: efficiently summarizing transactions for clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[19]  John R. Kender,et al.  Optimizing Frequency Queries for Data Mining Applications , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[20]  Hui Xiong,et al.  HICAP: Hierarchical Clustering with Pattern Preservation , 2004, SDM.

[21]  Chienwen Wu,et al.  Mining Top-K Frequent Closed Itemsets Is Not in APX , 2006, PAKDD.

[22]  Johannes Fürnkranz,et al.  From Local Patterns to Global Models: The LeGo Approach to Data Mining , 2008 .

[23]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[24]  John R. Kender,et al.  High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets , 2006, Sixth International Conference on Data Mining (ICDM'06).

[25]  Nick Cercone,et al.  Share Based Measures for Itemsets , 1997, PKDD.