CCM: A Text Classification Model by Clustering

In this paper, a new Cluster based Classification Model (CCM) for suspicious email detection and other text classification tasks, is presented. Comparative experiments of the proposed model against traditional classification models and the boosting algorithm are also discussed. Experimental results show that the CCM outperforms traditional classification models as well as the boosting algorithm for the task of suspicious email detection on terrorism domain email dataset and topic categorization on the Reuters-21578 and 20 Newsgroups datasets. The overall finding is that applying a cluster based approach to text classification tasks simplifies the model and at the same time increases the accuracy.

[1]  James Hardy Wilkinson,et al.  Rigorous Error Bounds for Computer Eigensystems , 1961, Comput. J..

[2]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[5]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[6]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[9]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[10]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[11]  Categorization and Feature Selection Using Association Rule and Principal Component Clustering , 1997 .

[12]  Robert H. Gross,et al.  Web Page Categorization and Feature Selection Using Association Rule and Principal Component Cluster , 1997 .

[13]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[14]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[15]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[16]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[17]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  R. Bekkerman,et al.  Using Bigrams in Text Categorization , 2003 .

[20]  Hongjun Lu,et al.  CBC: clustering based text classification requiring minimal labeled data , 2003, Third IEEE International Conference on Data Mining.

[21]  Steven L. Salzberg,et al.  Book Review: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993 , 1994, Machine Learning.

[22]  Michael K. Ng,et al.  A Feature Weighting Approach to Building Classification Models by Interactive Clustering , 2004, MDAI.

[23]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[24]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[25]  Theodore Kalamboukis,et al.  Using clustering to enhance text classification , 2007, SIGIR.

[26]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[27]  Antonia Kyriakopoulou,et al.  Text Classification Aided by Clustering: a Literature Review , 2008 .

[28]  T. Kalamboukis,et al.  Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems ? , 2008 .

[29]  Korris Fu-Lai Chung,et al.  Building a Decision Cluster Classification Model for High Dimensional Data by a Variable Weighting k-Means Method , 2008, Australasian Conference on Artificial Intelligence.

[30]  S. Appavu alias Balamurugan,et al.  Learning to classify threatening e-mail , 2008, Int. J. Artif. Intell. Soft Comput..

[31]  Shixiong Xia,et al.  An Improved KNN Text Classification Algorithm Based on Clustering , 2009, J. Comput..

[32]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[33]  Yan Li,et al.  Building a Decision Cluster Forest Model to Classify High Dimensional Data with Multi-classes , 2009, ACML.

[34]  D. Karthika Renuka,et al.  Email classification for Spam Detection using Word Stemming , 2010 .