A Concept Vector Space Model for Semantic Kernels

Kernels are widely used in Natural Language Processing as similarity measures within inner-product based learning methods like the Support Vector Machine. The Vector Space Model (VSM) is extensively used for the spatial representation of the documents. However, it is purely a statistical representation. In this paper, we present a Concept Vector Space Model (CVSM) representation which uses linguistic prior knowledge to capture the meanings of the documents. We also propose a linear kernel and a latent kernel for this space. The linear kernel takes advantage of the linguistic concepts whereas the latent kernel combines statistical and linguistic concepts. Indeed, the latter kernel uses latent concepts extracted by the Latent Semantic Analysis (LSA) in the CVSM. The kernels were evaluated on a text categorization task in the biomedical domain. The Ohsumed corpus, well known for being difficult to categorize, was used. The results have shown that the CVSM improves performance compared to the VSM.

[1]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[2]  David J. Crisp,et al.  Uniqueness of the SVM Solution , 1999, NIPS.

[3]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[4]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[5]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[6]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[9]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[10]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Florence d'Alché-Buc,et al.  Support Vector Machines based on a semantic kernel for text categorization , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[13]  Wei-Ying Ma,et al.  Improving text classification using local latent semantic indexing , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[14]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[15]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[16]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[17]  David A. Hull,et al.  Dean of Graduate Studies , 2000 .

[18]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[19]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[20]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[21]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[22]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[23]  Qi He,et al.  Using Burstiness to Improve Clustering of Topics in News Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[24]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[25]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[26]  Emmanuel Viennet,et al.  A Semantic Kernel for Semi-structured DocumentS , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[27]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[28]  Sujeevan Aseervatham,et al.  A local Latent Semantic Analysis-based kernel for document similarities , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[29]  Donna K. Harman,et al.  The History of IDF and Its Influences on IR and Other Fields , 2005 .

[30]  Jian Su,et al.  Text Representations for Text Categorization: A Case Study in Biomedical Domain , 2007, 2007 International Joint Conference on Neural Networks.

[31]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[32]  Carlo Strapparava,et al.  Domain Kernels for Text Categorization , 2005, CoNLL.

[33]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[34]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[35]  Roberto Basili,et al.  A Semantic Kernel to Classify Texts with Very Few Training Examples , 2006, Informatica.

[36]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[37]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[38]  Allen C. Browne,et al.  dTagger: A POS Tagger , 2006, AMIA.