On Efficient Meta-Level Features for Effective Text Classification

This paper addresses the problem of automatically learning to classify texts by exploiting information derived from meta-level features (i.e., features derived from the original bag-of-words representation). We propose new meta-level features derived from the class distribution, the entropy and the within-class cohesion observed in the k nearest neighbors of a given test document x, as well as from the distribution of distances of x to these neighbors. The set of proposed features is capable of transforming the original feature space into a new one, potentially smaller and more informed. Experiments performed with several standard datasets demonstrate that the effectiveness of the proposed meta-level features is not only much superior than the traditional bag-of-word representation but also superior to other state-of-art meta-level features previously proposed in the literature. Moreover, the proposed meta-features can be computed about three times faster than the existing meta-level ones, making our proposal much more scalable. We also demonstrate that the combination of our meta features and the original set of features produce significant improvements when compared to each feature set used in isolation.

[1]  S. Sameen Fatima,et al.  Text Categorization with K-Nearest Neighbor Approach , 2012 .

[2]  Theodore Kalamboukis,et al.  Using clustering to enhance text classification , 2007, SIGIR.

[3]  Michel Barlaud,et al.  Fast k nearest neighbor search using GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[4]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[5]  Bernard Zenko,et al.  Is Combining Classifiers with Stacking Better than Selecting the Best One? , 2004, Machine Learning.

[6]  Yiming Yang,et al.  Multilabel classification with meta-level features , 2010, SIGIR.

[7]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[8]  Nora Reyes,et al.  (Very) Fast (All) k-Nearest Neighbors in Metric and Non Metric Spaces without Indexing , 2013, SISAP.

[9]  Adam Kowalczyk,et al.  Using Unlabelled Data for Text Classification through Addition of Cluster Parameters , 2002, International Conference on Machine Learning.

[10]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[11]  T. Kalamboukis,et al.  Text Classification Using Clustering , 2006 .

[12]  T. Kalamboukis,et al.  Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems ? , 2008 .

[13]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[14]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Adriano M. Pereira,et al.  Exploiting temporal contexts in text classification , 2008, CIKM '08.

[17]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[18]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[19]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[20]  Hao Chen,et al.  Evaluation of decision forests on text categorization , 1999, Electronic Imaging.

[21]  Koby Crammer,et al.  Confidence-Weighted Linear Classification for Text Categorization , 2012, J. Mach. Learn. Res..

[22]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[23]  Yiming Yang,et al.  Multilabel classification with meta-level features in a learning-to-rank framework , 2011, Machine Learning.

[24]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[25]  Sunita Sarawagi,et al.  Discriminative Methods for Multi-labeled Classification , 2004, PAKDD.