Local Relevance Weighted Maximum Margin Criterion for Text Classification

Text classification is a very important task in information retrieval and data mining. In vector space model (VSM), document is represented as a high dimensional vector, and a feature extraction phase is usually needed to reduce the dimensionality of the document. In this paper, we propose a feature extraction method, named Local Relevance Weighted Maximum Margin Criterion (LRWMMC). It aims to learn a subspace in which the documents in the same class are as near as possible while the documents in the different classes are as far as possible in the local region of each document. Furthermore, the relevance is taken into account as a weight to determine the extent to which the documents will be projected. LRWMMC is able to find the low dimensional manifold embedded in the high dimensional ambient space. In addition, We generalize LRWMMC to Reproducing Kernel Hilbert Space (RKHS), which can resolve the nonlinearity of the input space. We also generalize LRWMMC to tensor space which is suitable for a new document representation, named tensor space model (TSM). On the other hand, in order to utilize the large amount of unlabeled documents, we also present a Semi-Supervised LRWMMC, which aims to find a projection inferred from the labeled samples, as well as the unlabeled samples. Finally, we present a fast algorithm based on QR-decomposition to make the methods proposed in this paper apply for large scale data set. Encouraging experimental results on benchmark text classification data sets indicate that the proposed methods outperform many existing feature extraction methods for text classification.

[1]  Wei-Ying Ma,et al.  Improving text classification using local latent semantic indexing , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[2]  Dong Xu,et al.  Discriminant analysis with tensor representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Xi Chen,et al.  Text classification with kernels on the multinomial manifold , 2005, SIGIR '05.

[4]  Haixian Wang,et al.  Local and Weighted Maximum Margin Discriminant Analysis , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[6]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[7]  Kun Zhou,et al.  Locality Sensitive Discriminant Analysis , 2007, IJCAI.

[8]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9]  Gene H. Golub,et al.  Matrix computations , 1983 .

[10]  Jimeng Sun,et al.  Beyond streams and graphs: dynamic tensor analysis , 2006, KDD '06.

[11]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[12]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[13]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[14]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[15]  Hui Xiong,et al.  IDR/QR: an incremental dimension reduction algorithm via QR decomposition , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[17]  Efstratios Gallopoulos,et al.  CLSI: A Flexible Approximation Scheme from Clustered Term-Document Matrices , 2005, SDM.

[18]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[19]  Zheng Chen,et al.  Text representation: from vector to tensor , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[20]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[21]  Efstratios Gallopoulos,et al.  TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections , 2006, Grouping Multidimensional Data.

[22]  Jiawei Han,et al.  Efficient Kernel Discriminant Analysis via Spectral Regression , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[23]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[24]  Philip S. Yu,et al.  Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams , 2006, Sixth International Conference on Data Mining (ICDM'06).

[25]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[26]  Jiawei Han,et al.  Training Linear Discriminant Analysis in Linear Time , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[27]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[28]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[29]  Xiangyang Xue,et al.  Optimal dimensionality of metric space for classification , 2007, ICML '07.

[30]  Joos Vandewalle,et al.  On the Best Rank-1 and Rank-(R1 , R2, ... , RN) Approximation of Higher-Order Tensors , 2000, SIAM J. Matrix Anal. Appl..

[31]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Efstratios Gallopoulos,et al.  Linear and Non-Linear Dimensional Reduction via Class Representatives for Text Classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[33]  Kari Torkkola,et al.  Linear Discriminant Analysis in Document Classification , 2007 .

[34]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.