Ambiguity measure feature-selection algorithm

With the increasing number of digital documents, the ability to automatically classify those documents both efficiently and accurately is becoming more critical and difficult. One of the major problems in text classification is the high dimensionality of feature space. We present the ambiguity measure (AM) feature-selection algorithm, which selects the most unambiguous features from the feature set. Unambiguous features are those features whose presence in a document indicate a strong degree of confidence that a document belongs to only one specific category. We apply AM feature selection on a naïve Bayes text classifier.We favorably show the effectiveness of our approach in outperforming eight existing featureselection methods, using five benchmark datasets with a statistical significance of at least 95% confidence. The support vector machine (SVM) text classifier is shown to perform consistently better than the naïve Bayes text classifier. The drawback, however, is the time complexity in training a model. We further explore the effect of using the AM feature-selection method on an SVM text classifier. Our results indicate that the training time for the SVM algorithm can be reduced by more than 50%, while still improving the accuracy of the text classifier. We favorably show the effectiveness of our approach by demonstrating that it statistically significantly (99% confidence) outperforms eight existing feature-selection methods using four standard benchmark datasets.

[1]  Fabrizio Sebastiani,et al.  Distributional term representations: an experimental comparison , 2004, CIKM '04.

[2]  Saket S. R. Mengle,et al.  FACT: Fast Algorithm for Categorizing Text , 2007, 2007 IEEE Intelligence and Security Informatics.

[3]  Peter A. Flach,et al.  Feature Selection with Labelled and Unlabelled Data , 2002 .

[4]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[5]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[6]  George Forman,et al.  BNS feature scaling: an improved representation over tf-idf for svm text classification , 2008, CIKM '08.

[7]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[8]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[9]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[10]  J. Novovicova,et al.  Information-theoretic feature selection algorithms for text classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[11]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[12]  hierarchyDunja Mladeni Feature Selection for Classiication Based on Text Hierarchy , 1998 .

[13]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[14]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[15]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[16]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[17]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[18]  Wei-Ying Ma,et al.  OCFS: optimal orthogonal centroid feature selection for text categorization , 2005, SIGIR '05.

[19]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[20]  Ophir Frieder,et al.  Combining Semantics, Context, and Statistical Evidence in Genomics Literature Search , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[21]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[22]  Narayanan Kulathuramaiyer,et al.  An Empirical Study of Feature Selection for Text Categorization based on Term Weightage , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[23]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[24]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[25]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[26]  Saket S. R. Mengle,et al.  Using ambiguity measure feature selection algorithm for support vector machine classifier , 2008, SAC '08.

[27]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[28]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[29]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[30]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[31]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[32]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[33]  Yi Liu,et al.  One-against-all multi-class SVM classification using reliability measures , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[34]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[35]  Hugo Zaragoza,et al.  Information Retrieval: Algorithms and Heuristics , 2002, Information Retrieval.

[36]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.