A new Centroid-Based Classification model for text categorization

The automatic text categorization technique has gained significant attention among researchers because of the increasing availability of online text information. Therefore, many different learning approaches have been designed in the text categorization field. Among them, the widely used method is the Centroid-Based Classifier (CBC) due to its theoretical simplicity and computational efficiency. However, the classification accuracy of CBC greatly depends on the data distribution. Thus it leads to a misfit model and also has poor classification performance when the data distribution is highly skewed. In this paper, a new classification model named as Gravitation Model (GM) is proposed to solve the class-imbalanced classification problem. In the training phase, each class is weighted by a mass factor, which can be learned from the training data, to indicate data distribution of the corresponding class. In the testing phase, a new document will be assigned to a particular class with the max gravitational force. The performance comparisons with CBC and its variants based on the results of experiments conducted on twelve real datasets show that the proposed gravitation model consistently outperforms CBC together with the Class-Feature-Centroid Classifier (CFC). Also, it obtains the classification accuracy competitive to the DragPushing (DP) method while it maintains a more stable performance. Thus, the proposed gravitation model is proved to be less over-fitting and has higher learning ability than CBC model.

[1]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[2]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[3]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[4]  Xiaoli Li,et al.  A refinement approach to handling model misfit in text categorization , 2002, KDD.

[5]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[6]  Juan José Rodríguez Diez,et al.  A weighted voting framework for classifiers ensembles , 2012, Knowledge and Information Systems.

[7]  Minyi Guo,et al.  A class-feature-centroid classifier for text categorization , 2009, WWW '09.

[8]  Ester Bernadó-Mansilla,et al.  Evolutionary rule-based systems for imbalanced data sets , 2008, Soft Comput..

[9]  Verayuth Lertnattee,et al.  Class normalization in centroid-based text categorization , 2006, Inf. Sci..

[10]  Songbo Tan,et al.  An improved centroid classifier for text categorization , 2008, Expert Syst. Appl..

[11]  Yiming Yang,et al.  Boosting to correct inductive bias in text classification , 2002, CIKM '02.

[12]  Chung-Hsing Yeh,et al.  A Neural Network Model for Hierarchical Multilingual Text Categorization , 2005, ISNN.

[13]  Yaxin Bi,et al.  Using kNN model for automatic text categorization , 2006, Soft Comput..

[14]  T. Theeramunkong,et al.  Analysis of inverse class frequency in centroid-based text classification , 2004, IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004..

[15]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[16]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .

[19]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[20]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[21]  Junjie Wu,et al.  Towards enhancing centroid classifier for text classification - A border-instance approach , 2013, Neurocomputing.

[22]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[23]  Ying He,et al.  MSMOTE: Improving Classification Performance When Training Data is Imbalanced , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[24]  Hareton K. N. Leung,et al.  Hybrid $k$ -Nearest Neighbor Classifier , 2016, IEEE Transactions on Cybernetics.

[25]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[26]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[27]  Wenyong Wang,et al.  An efficient instance selection algorithm to reconstruct training set for support vector machine , 2017, Knowl. Based Syst..

[28]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[29]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[30]  Xin Xu,et al.  A Class-Incremental Learning Method for Multi-Class Support Vector Machines in Text Classification , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[31]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[32]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[33]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[34]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[35]  Franz Aurenhammer,et al.  Voronoi diagrams—a survey of a fundamental geometric data structure , 1991, CSUR.

[36]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[37]  Rung Ching Chen,et al.  Web page classification based on a support vector machine using a weighted vote schema , 2006, Expert Syst. Appl..

[38]  Xiaolong Wang,et al.  A Framework of Centroid-Based Methods for Text Categorization , 2014, IEICE Trans. Inf. Syst..

[39]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[40]  Xiaowei Yang,et al.  Adaptive pruning algorithm for least squares support vector machine classifier , 2010, Soft Comput..

[41]  Shourya Roy,et al.  Fast and accurate text classification via multiple linear discriminant projections , 2003, The VLDB Journal.

[42]  H. Kashima,et al.  Roughly balanced bagging for imbalanced data , 2009 .

[43]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[44]  Zehra Cataltepe,et al.  An Improvement of Centroid-Based Classification Algorithm for Text Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[45]  Kai Ming Ting,et al.  A Comparative Study of Cost-Sensitive Boosting Algorithms , 2000, ICML.

[46]  Namita Mittal,et al.  Text Classification Using Machine Learning Methods-A Survey , 2012, SocProS.

[47]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[48]  Tunga Güngör,et al.  A high performance centroid-based classification approach for language identification , 2012, Pattern Recognit. Lett..

[49]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[50]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[51]  Szymon Wilk,et al.  Selective Pre-processing of Imbalanced Data for Improving Classification Performance , 2008, DaWaK.

[52]  Paolo Napoletano,et al.  Text classification using a few labeled examples , 2014, Comput. Hum. Behav..

[53]  Guangquan Zhang,et al.  Uncertainty Analysis for the Keyword System of Web Events , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[54]  Siu Cheung Hui,et al.  Supervised term weighting centroid-based classifiers for text categorization , 2012, Knowledge and Information Systems.

[55]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[56]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[57]  Koby Crammer,et al.  Margin Analysis of the LVQ Algorithm , 2002, NIPS.

[58]  Zhiwen Yu,et al.  Hybrid Adaptive Classifier Ensemble , 2015, IEEE Transactions on Cybernetics.

[59]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[60]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[61]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[62]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Dik Lun Lee,et al.  Feature reduction for neural network based text categorization , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[64]  Man Lan,et al.  A comparative study on term weighting schemes for text categorization , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[65]  Shengyi Jiang,et al.  A generalized cluster centroid based classifier for text categorization , 2013, Inf. Process. Manag..

[66]  Xin Yao,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Relationships between Diversity of Classification Ensembles and Single-class Performance Measures , 2022 .

[67]  Mohamed S. Kamel,et al.  Pairwise optimized Rocchio algorithm for text categorization , 2011, Pattern Recognit. Lett..

[68]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[69]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[70]  Siyang Wang,et al.  A new validity index of feature subset for evaluating the dimensionality reduction algorithms , 2017, Knowl. Based Syst..

[71]  Songbo Tan,et al.  Large margin DragPushing strategy for centroid text categorization , 2007, Expert Syst. Appl..

[72]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[73]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[74]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[75]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[76]  Szymon Wilk,et al.  Integrating Selective Pre-processing of Imbalanced Data with Ivotes Ensemble , 2010, RSCTC.

[77]  Taghi M. Khoshgoftaar,et al.  An empirical comparison of repetitive undersampling techniques , 2009, 2009 IEEE International Conference on Information Reuse & Integration.

[78]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[79]  Jane You,et al.  Representative Distance: A New Similarity Measure for Class Discovery From Gene Expression Data , 2012, IEEE Transactions on NanoBioscience.

[80]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[81]  Songbo Tan,et al.  Using Error-Correcting Output Codes with Model-Refinement to Boost Centroid Text Classifier , 2007, ACL.

[82]  Wenyong Wang,et al.  A new feature selection method based on a validity index of feature subset , 2017, Pattern Recognit. Lett..