On the strength of hyperclique patterns for text categorization

The use of association patterns for text categorization has attracted great interest and a variety of useful methods have been developed. However, the key characteristics of pattern-based text categorization remain unclear. Indeed, there are still no concrete answers for the following two questions: Firstly, what kind of association pattern is the best candidate for pattern-based text categorization? Secondly, what is the most desirable way to use patterns for text categorization? In this paper, we focus on answering the above two questions. More specifically, we show that hyperclique patterns are more desirable than frequent patterns for text categorization. Along this line, we develop an algorithm for text categorization using hyperclique patterns. As demonstrated by our experimental results on various real-world text documents, our method provides much better computational performance than state-of-the-art methods while retaining classification accuracy.

[1]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[2]  Hui Xiong,et al.  Mining strong affinity association patterns in data sets with skewed support distribution , 2003, Third IEEE International Conference on Data Mining.

[3]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[4]  Wei-Ying Ma,et al.  OCFS: optimal orthogonal centroid feature selection for text categorization , 2005, SIGIR '05.

[5]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[6]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Jianyong Wang,et al.  HARMONY: Efficiently Mining the Best Rules for Classification , 2005, SDM.

[8]  Joshua B. Tenenbaum,et al.  Global Versus Local Methods in Nonlinear Dimensionality Reduction , 2002, NIPS.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[11]  Hui Xiong,et al.  Hyperclique pattern discovery , 2006, Data Mining and Knowledge Discovery.

[12]  Christopher J. Fox,et al.  A stop list for general text , 1989, SIGF.

[13]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[14]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[15]  Yuanzhen Wang,et al.  2-PS Based Associative Text Classification , 2005, DaWaK.

[16]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[17]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[18]  Ke Wang,et al.  Building Hierarchical Classifiers Using Class Proximity , 1999, VLDB.

[19]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[20]  Philip S. Yu,et al.  Scoring the Data Using Association Rules , 2003, Applied Intelligence.

[21]  Jing Zou,et al.  SAT-MOD: moderate itemset fittest for text classification , 2005, WWW '05.

[22]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23]  Verayuth Lertnattee,et al.  Class normalization in centroid-based text categorization , 2006, Inf. Sci..

[24]  I. Jolliffe Principal Component Analysis , 2002 .

[25]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[26]  Hongjun Lu,et al.  Scalable association-based text classification , 2000, CIKM '00.

[27]  Chong-Ho Choi,et al.  Feature Extraction Based on ICA for Binary Classification Problems , 2003, IEEE Trans. Knowl. Data Eng..

[28]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[29]  George Karypis,et al.  Using conjunction of attribute values for classification , 2002, CIKM '02.

[30]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[31]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[32]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[33]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.

[34]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[35]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[36]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[37]  Osmar R. Zaïane,et al.  Text document categorization by term association , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[38]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[40]  R. Bekkerman Distributional Word Clusters vs , 2006 .