OCFS: optimal orthogonal centroid feature selection for text categorization

Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection. Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) andχ2-test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[3]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[4]  J. Gentle Numerical Linear Algebra for Applications in Statistics , 1998 .

[5]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Haesun Park,et al.  Generalizing discriminant analysis using the generalized singular value decomposition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[8]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9]  Nicholas J. Belkin,et al.  Retrieval techniques , 1987 .

[10]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[11]  Ed Greengrass,et al.  Information Retrieval: A Survey , 2000 .

[12]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[13]  Michael K. Buckland,et al.  Annual Review of Information Science and Technology , 2006, J. Documentation.

[14]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[15]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[16]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[17]  J. Ben Rosen,et al.  Dimension reduction based on centroids and least squares for efficient processing of text data , 2001, SDM.

[18]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[19]  Gang Wang,et al.  Feature selection with conditional mutual information maximin in text categorization , 2004, CIKM '04.

[20]  Robert X. Gao,et al.  PCA-based feature selection scheme for machine defect classification , 2004, IEEE Transactions on Instrumentation and Measurement.

[21]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[22]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .

[23]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[25]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[26]  Constantin F. Aliferis,et al.  A theoretical characterization of linear SVM-based feature selection , 2004, ICML '04.

[27]  Daoqiang Zhang,et al.  Efficient and robust feature extraction by maximum margin criterion , 2003, IEEE Transactions on Neural Networks.

[28]  Hua Li,et al.  IMMC: incremental maximum margin criterion , 2004, KDD.