Using ambiguity measure feature selection algorithm for support vector machine classifier

With the ever-increasing number of documents on the web, digital libraries, news sources, etc., the need of a text classifier that can classify massive amount of data is becoming more critical and difficult. The major problem in text classification is the high dimensionality of feature space. The Support Vector Machine (SVM) classifier is shown to perform consistently better than other text classification algorithms. However, the time taken for training a SVM model is more than other algorithms. We explore the use of the Ambiguity Measure (AM) feature selection method that uses only the most unambiguous keywords to predict the category of a document. Our analysis shows that AM reduces the training time by more than 50% than the scenario when no feature selection is used, while maintaining the accuracy of the text classifier equivalent to or better than using the whole feature set. We empirically show the effectiveness of our approach in outperforming seven different feature selection methods using two standard benchmark datasets.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  R. Srihari,et al.  Optimally Combining Positive and Negative Features for Text Categorization , 2003 .

[3]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[4]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[5]  Wei-Ying Ma,et al.  OCFS: optimal orthogonal centroid feature selection for text categorization , 2005, SIGIR '05.

[6]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[7]  Narayanan Kulathuramaiyer,et al.  An Empirical Study of Feature Selection for Text Categorization based on Term Weightage , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[8]  Saket S. R. Mengle,et al.  FACT: Fast Algorithm for Categorizing Text , 2007, 2007 IEEE Intelligence and Security Informatics.

[9]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[10]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  J. Novovicova,et al.  Information-theoretic feature selection algorithms for text classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[13]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .