Wikipedia-Based Kernels for Text Categorization

In recent years several models have been proposed for text categorization. Within this, one of the widely applied models is the vector space model (VSM), where independence between indexing terms, usually words, is assumed. Since training corpora sizes are relatively small - compared to ap infin what would be required for a realistic number of words - the generalization power of the learning algorithms is low. It is assumed that a bigger text corpus can boost the representation and hence the learning process. Based on the work of Gabrilovich and Markovitch [6], we incorporate Wikipedia articles into the system to give word distributional representation for documents. The extension with this new corpus causes dimensionality increase, therefore clustering of features is needed. We use latent semantic analysis (LSA), kernel principal component analysis (KPCA) and kernel canonical correlation analysis (KCCA) and present results for these experiments on the Reuters corpus.

[1]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[2]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[3]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[4]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[6]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[7]  Yiming Yang,et al.  Translingual Information Retrieval: Learning from Bilingual Corpora , 1998, Artif. Intell..

[8]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[11]  Koby Crammer,et al.  A new family of online algorithms for category ranking , 2002, SIGIR '02.

[12]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[13]  Alessandro Moschitti,et al.  A Study on Optimal Parameter Tuning for Rocchio Text Classifier , 2003, ECIR.

[14]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[15]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[16]  L. Csató,et al.  Segmentation-based Feature Selection for Text Categorization , 2006 .

[17]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.