Classification by Pattern-based Hierarchical Clustering

In this paper, we propose CPHC, a semi-supervised classification algorithm that uses a pattern-based cluster hierarchy as a direct means for classification. All training and test instances are first clustered together using an instance-driven pattern-based hierarchical clustering algorithm that allows each instance to "vote" for its representative size-2 patterns in a way that balances local pattern significance and global pattern interestingness. These patterns form initial clusters and the rest of the cluster hierarchy is obtained by following a unique iterative cluster refinement process that exploits local information. The resulting cluster hierarchy is then used directly to classify test instances, eliminating the need to train a classifier on an enhanced training set. For each test instance, we first use the hierarchical structure to identify nodes that contain the test instance, and then use the labels of co-existing training instances, weighing them proportionately to their pattern lengths, to obtain the most likely class label(s) for the test instance. In addition, CPHC increases the chances of classifying isolated test instances by inducing a type of feature transitivity. Results of experiments performed on 19 standard text and machine learning datasets show that CPHC outperforms a number of existing classification algorithms even with sparse (as low as 1%) training data.

[1]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[2]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[3]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[4]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[5]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[6]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[9]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Jiawei Han,et al.  Scalable construction of topic directory with nonparametric closed termset mining , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[12]  Hongjun Lu,et al.  CBC: clustering based text classification requiring minimal labeled data , 2003, Third IEEE International Conference on Data Mining.

[13]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[14]  Jianyong Wang,et al.  On Mining Instance-Centric Classification Rules , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[16]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[17]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[18]  Jinyan,et al.  DeEPs : A New Instance-based Discovery and Classi cation System , 2001 .

[19]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[20]  Hui Xiong,et al.  HICAP: Hierarchical Clustering with Pattern Preservation , 2004, SDM.

[21]  John R. Kender,et al.  High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets , 2006, Sixth International Conference on Data Mining (ICDM'06).

[22]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[23]  Osmar R. Zaïane,et al.  Text document categorization by term association , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[24]  Theodore Kalamboukis,et al.  Using clustering to enhance text classification , 2007, SIGIR.

[25]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[26]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[27]  John R. Kender,et al.  Instance Driven Hierarchical Clustering of Document Collections , 2008 .

[28]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[29]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[30]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[31]  Adam Kowalczyk,et al.  Using Unlabelled Data for Text Classification through Addition of Cluster Parameters , 2002, International Conference on Machine Learning.