Support vector machines for text categorization

The purpose of this research is to make Support Vector Machine (SVM) a more effective method for text categorization problems. This research focuses on two aspects of the problem: better document representation for SVM and improved SVM formulations for text categorization problems. In searching for document representation that is more suited for SVM, we first conduct a thorough empirical study on how various document representation methods influence the performance of two-class SVM on text categorization. We then investigate the issues in using standard term-weighting document representation schemes with one-class SVM, which not only results in better document representation for one-class SVM but also reveals insights into the internal structure of binary classification problems converted from multi-class problems in one-versus-rest fashion. To improve SVM for text categorization, we introduce a μν-SVM formulation, which allows for intuitive model selection for problems with highly imbalanced datasets like text categorization. This formulation greatly reduces the need for expensive cross-validation. We also motivate the weighted-margin SVM formulation, which makes predictions based on both the evidence embedded in training examples and prior knowledge in form of the weak prediction rules. By integrating the keyword-based weak prediction rules, the need for larger labeled datasets can be greatly reduce. Experiments demonstrate the improved performance due to the enhancements made to both the SVM formulations and document representation.