Centroid Training to achieve effective text classification

Traditional text classification technology based on machine learning and data mining techniques has made a big progress. However, it is still a big problem on how to draw an exact decision boundary between relevant and irrelevant objects in binary classification due to much uncertainty produced in the process of the traditional algorithms. The proposed model CTTC (Centroid Training for Text Classification) aims to build an uncertainty boundary to absorb as many indeterminate objects as possible so as to elevate the certainty of the relevant and irrelevant groups through the centroid clustering and training process. The clustering starts from the two training subsets labelled as relevant or irrelevant respectively to create two principal centroid vectors by which all the training samples are further separated into three groups: POS, NEG and BND, with all the indeterminate objects absorbed into the uncertain decision boundary BND. Two pairs of centroid vectors are proposed to be trained and optimized through the subsequent iterative multi-learning process, all of which are proposed to collaboratively help predict the polarities of the incoming objects thereafter. For the assessment of the proposed model, F1 and Accuracy have been chosen as the key evaluation measures. We stress the F1 measure because it can display the overall performance improvement of the final classifier better than Accuracy. A large number of experiments have been completed using the proposed model on the Reuters Corpus Volume 1 (RCV1) which is important standard dataset in the field. The experiment results show that the proposed model has significantly improved the binary text classification performance in both F1 and Accuracy compared with three other influential baseline models.

[1]  Bhavani M. Thuraisingham,et al.  A new intrusion detection system using support vector machines and hierarchical clustering , 2007, The VLDB Journal.

[2]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[3]  Yuefeng Li,et al.  Rough Set Based Approach to Text Classification , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[4]  Yongping Huang,et al.  A Text Classification Algorithm Based on Rocchio and Hierarchical Clustering , 2011, ICIC.

[5]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[6]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Yue Xu,et al.  Deploying Approaches for Pattern Refinement in Text Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[10]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[11]  James L. Carroll,et al.  A bayesian decision theoretical approach to supervised learning, selective sampling, and empirical function optimization , 2010 .

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Yuefeng Li,et al.  Mining positive and negative patterns for relevance feature discovery , 2010, KDD.

[14]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[15]  Yuefeng Li,et al.  Effective Pattern Discovery for Text Mining , 2012, IEEE Transactions on Knowledge and Data Engineering.

[16]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[17]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[18]  Charles L. A. Clarke,et al.  Improving document clustering using Okapi BM25 feature weighting , 2011, Information Retrieval.

[19]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[20]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.