Random Subspace Method in Text Categorization

In text categorization (TC), which is a supervised technique, a feature vector of terms or phrases is usually used to represent the documents. Due to the huge number of terms in even a moderate-size text corpus, high dimensional feature space is an intrinsic problem in TC. Random subspace method (RSM), a technique that divides the feature space to smaller ones each submitted to a (base) classifier (BC) in an ensemble, can be an effective approach to reduce the dimensionality of the feature space. Inspired by a similar research on functional magnetic resonance imaging (fMRI) of brain, here we address the estimation of ensemble parameters, i.e., the ensemble size (L) and the dimensionality of feature subsets (M) by defining three criteria: usability, coverage, and diversity of the ensemble. We will show that relatively medium M and small L yield an ensemble that improves the performance of a single support vector machine, which is considered as the state-of-the-art in TC.