Word Cloud Model for Text Categorization

In centroid-based classification, each class is represented by a prototype or centroid document vector that is formed by averaging all member vectors during the training phase. In the prediction phase, the label of a test document vector is assigned to that of its nearest class prototype. Recently there has been revived interest in reformulating the prototype/centroid to improve classification performance. In this paper, we study the theoretical properties of the recently proposed Class Feature Centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). The implication of our theoretical finding is that CFC is inherently biased towards large (dominant majority) classes, which means it is destined to perform poorly for highly class-imbalanced data. Another practical concern about CFC lies in its overly-aggressive design in weeding out terms that appear in all classes. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved and robust centroid-based classifier that uses precise term-class distribution properties instead of simple presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback-Leibler divergence measure between pairs of class-conditional term probabilities, we call this the CFC-KL centroid classifier. We then generalized CFC-KL to handle multi-class data by summing pair wise class-conditioned word probability ratios. Our proposed approach has been evaluated on 5 datasets, on which it consistently outperforms CFC and the baseline Support Vector Machine classifier. We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our CFC-KL, and visually compare it with other popular term weigthing approaches. Our encouraging results show that the centroid based generalized CFC-KL classifier is both robust and efficient to deal with real-world text classification.

[1]  Vasileios Hatzivassiloglou,et al.  Predicting the Semantic Orientation of Adjectives , 1997, ACL.

[2]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[3]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[4]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[7]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[8]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[9]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[10]  Minyi Guo,et al.  A class-feature-centroid classifier for text categorization , 2009, WWW '09.

[11]  Janyce Wiebe,et al.  Recognizing subjectivity: a case study in manual tagging , 1999, Natural Language Engineering.

[12]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[13]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[14]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[15]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[16]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[17]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[18]  Timothy W. Finin,et al.  Improving binary classification on text problems using differential word features , 2009, CIKM.