Feature Ranking for Text Classifiers

Feature selection based on feature ranking has received much attention by researchers in the field of text classification. The major reasons are their scalability, ease of use, and fast computation. However, compared to the search-based feature selection methods such as wrappers and filters, they suffer from poor performance. This is linked to their major deficiencies, including: (i) feature ranking is problem-dependent; (ii) they ignore term dependencies, including redundancies and correlation; and (iii) they usually fail in unbalanced data. While using feature ranking methods for dimensionality reduction, we should be aware of these drawbacks, which arise from the function of feature ranking methods. In this thesis, a set of solutions is proposed to handle the drawbacks of feature ranking and boost their performance. First, an evaluation framework called feature meta-ranking is proposed to evaluate ranking measures. The framework is based on a newly proposed Differential Filter Level Performance (DFLP) measure. It was proved that, in ideal cases, the performance of text classifier is a monotonic, non-decreasing function of the number of features. Then we theoretically and empirically validate the effectiveness of DFLP as a meta-ranking measure to evaluate and compare feature ranking methods. The meta-ranking framework is also examined by a stopword extraction problem. We use the framework to select appropriate feature ranking measure for building domain-specific stoplists. The proposed framework is evaluated by SVM and Rocchio text classifiers on six benchmark data. The meta-ranking method suggests that in searching for a proper feature ranking measure, the backward feature ranking is as important as the forward one. Second, we show that the destructive effect of term redundancy gets worse as we decrease the feature ranking threshold. It implies that for aggressive feature selection, an effective redundancy reduction should be performed as well as feature ranking. An algorithm based on extracting term dependency links using an information theoretic inclusion index is proposed to detect and handle term dependencies. The dependency links are visualized by a tree structure called a term dependency tree. By grouping the nodes of the tree into two categories, including hub and link nodes, a heuristic algorithm is proposed to handle the term dependencies by merging or removing the link nodes. The proposed method of redundancy reduction is evaluated by SVM and Rocchio classifiers for four benchmark data sets. According to the results, redundancy reduction is more effective on weak classifiers since they are more sensitive to term redundancies. It also suggests that in those feature ranking methods which compact the information in a small number of features, aggressive feature selection is not recommended. Finally, to deal with class imbalance in feature level using ranking methods, a local feature ranking scheme called reverse discrimination approach is proposed. The proposed method is applied to a highly unbalanced social network discovery problem. In this case study, the problem of learning a social network is translated into a text classification problem using newly proposed actor and relationship modeling. Since social networks are usually sparse structures, the corresponding text classifiers become highly unbalanced. Experimental assessment of the reverse discrimination approach validates the effectiveness of the local feature ranking method to improve the classifier performance when dealing with unbalanced data. The application itself suggests a new approach to learn social structures from textual data.

[1]  Javed Mostafa,et al.  An application of text categorization methods to gene ontology annotation , 2005, SIGIR '05.

[2]  M. Thelwall,et al.  A comparison of feature selection methods for an evolving RSS feed corpus , 2006, Inf. Process. Manag..

[3]  Thomas P. Trappenberg,et al.  Input variable selection: mutual information and linear mixing measures , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  Fredric C. Gey,et al.  The TREC 2002 Arabic/English CLIR Track , 2002, TREC.

[5]  Matthew Richardson,et al.  Mining knowledge-sharing sites for viral marketing , 2002, KDD.

[6]  Vijay V. Raghavan,et al.  Vector Space Model of Information Retrieval - A Reevaluation , 1984, SIGIR.

[7]  Pavel Pudil,et al.  Feature Selection Using Improved Mutual Information for Text Classification , 2004, SSPR/SPR.

[8]  Ankur Teredesai,et al.  A Framework for Mining Instant Messaging Services , 2004 .

[9]  Kôiti Hasida,et al.  Mining social network of conference participants from the Web , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[10]  Mitsuru Ishizuka,et al.  Keyword Extraction from the Web for FOAF Metadata , 2004 .

[11]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[12]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[13]  Masashi Sugiyama,et al.  Local Fisher discriminant analysis for supervised dimensionality reduction , 2006, ICML.

[14]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[15]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[16]  José Ranilla,et al.  Scoring and selecting terms for text categorization , 2005, IEEE Intelligent Systems.

[17]  Fredric C. Gey,et al.  UC Berkeley at CLEF-2003 - Russian Language Experiments and Domain-Specific Retrieval , 2003, CLEF.

[18]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[19]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[20]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[21]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[22]  Ellen Riloff,et al.  Little words can make a big difference for text classification , 1995, SIGIR '95.

[23]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[24]  C. J. van Rijsbergen,et al.  The selection of good search terms , 1981, Inf. Process. Manag..

[25]  David W. Corne,et al.  Towards modernised and Web-specific stoplists for Web document analysis , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[26]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[27]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[28]  Gerard Salton,et al.  Recent trends in automatic information retrieval , 1986, SIGIR '86.

[29]  Peter Mika Ontologies Are Us: A Unified Model of Social Networks and Semantics , 2005, International Semantic Web Conference.

[30]  Terrence L. Fine,et al.  Neural-network design for small training sets of high dimension , 1998, IEEE Trans. Neural Networks.

[31]  Clement T. Yu,et al.  Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[32]  Gang Wang,et al.  Feature selection with conditional mutual information maximin in text categorization , 2004, CIKM '04.

[33]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[34]  Sankar K. Pal,et al.  Pattern Recognition Algorithms for Data Mining , 2004 .

[35]  Prabhakar Raghavan,et al.  Information retrieval algorithms: a survey , 1997, SODA '97.

[36]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[37]  Jacques Savoy A stemming procedure and stopword list for general French corpora , 1999 .

[38]  Huan Liu,et al.  Improving backpropagation learning with feature selection , 1996, Applied Intelligence.

[39]  Guy W. Mineau,et al.  A Simple Feature Selection Method for Text Classification , 2001, IJCAI.

[40]  Sung Hyon Myaeng,et al.  Abbreviation disambiguation using semantic abstraction of symbols and numeric terms , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[41]  David W. Corne,et al.  Evolving Better Stoplists for Document Clustering and Web Intelligence , 2003, HIS.

[42]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Ian Soboroff,et al.  Collaborative filtering and the generalized vector space model (poster session) , 2000, SIGIR '00.

[44]  Kagan Tumer,et al.  A mutual information based ensemble method to estimate Bayes error , 1998 .

[45]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[46]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[47]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[48]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[49]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[50]  Cyrus Shahabi,et al.  Feature subset selection and feature ranking for multivariate time series , 2005, IEEE Transactions on Knowledge and Data Engineering.

[51]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[52]  Alon Orlitsky,et al.  Supervised dimensionality reduction using mixture models , 2005, ICML.