Text Classification Using Clustering

This paper addresses the problem of learning to classify texts by exploiting information derived from both training and testing sets. To accomplish this, clustering is used as a complementary step to text classification, and is applied not only to the training set but also to the testing set. This approach allows us to estimate the location of the testing examples and the structure of the whole dataset, which is not possible for an inductive learner. The incorporation of the knowledge resulting from clustering to the simple BOW representation of the texts is expected to boost the performance of a classifier. Experiments conducted on tasks and datasets provided in the framework of the ECDL/PKDD 2006 Challenge Discovery on personalized spam filtering, demonstrate the effectiveness of the proposed approach. The experiments show substantial improvements on classification performance especially for small training sets.

[1]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[2]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[3]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[4]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[5]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[6]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[7]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[8]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[9]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[10]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[11]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[12]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[13]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[14]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[15]  Fabrizio Sebastiani,et al.  A Tutorial on Automated Text Categorisation , 2000 .

[16]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[19]  Ran El-Yaniv,et al.  On feature distributional clustering for text categorization , 2001, SIGIR '01.

[20]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[21]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[22]  Yuji Matsumoto,et al.  Two-dimensional Clustering for Text Categorization , 2002, CoNLL.

[23]  Adam Kowalczyk,et al.  Using Unlabelled Data for Text Classification through Addition of Cluster Parameters , 2002, International Conference on Machine Learning.

[24]  Inderjit S. Dhillon,et al.  Enhanced word clustering for hierarchical text classification , 2002, KDD.

[25]  Adam Kowalczyk,et al.  Combining clustering and co-training to enhance text classification using unlabelled data , 2002, KDD.

[26]  Hongjun Lu,et al.  CBC: clustering based text classification requiring minimal labeled data , 2003, Third IEEE International Conference on Data Mining.

[27]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[28]  Hiroya Takamura,et al.  Clustering approaches to text categorization , 2003 .

[29]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[30]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[31]  George Karypis,et al.  Topic-driven Clustering for Document Datasets , 2005, SDM.

[32]  R. Bekkerman Distributional Word Clusters vs , 2006 .