Non-linear correspondence analysis in text retrieval: a kernel view

Classical factorial treatments applied on words-documents counts matrices (such as Correspondence Analysis (FCA), Latent Semantic Indexing (LSI), as well as non-linear generalizations of FCA (NLCA)) can be described in the framework of kernels associated to Support Vector Machines (SVM). This paper exposes the relationships between those formalisms, and demonstrates how textual pre-processing by a “power kernel” can improve (with respect to the classical FCA kernel) the documents classification in the Reuters-21578 corpus.

[1]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[2]  François Bavaud Generalized Factor Analyses for Contingency Tables , 2004 .

[3]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[4]  Andrew McCallum,et al.  Proceedings of the Ninth Conference on Computational Natural Language Learning, CoNLL 2005, Ann Arbor, Michigan, USA, June 29-30, 2005 , 2005, CoNLL.

[5]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[8]  John Platt,et al.  Fast training of svms using sequential minimal optimization , 1998 .

[9]  J. Aitchison,et al.  Biplots of Compositional Data , 2002 .

[10]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[12]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[13]  Carlo Strapparava,et al.  Domain Kernels for Text Categorization , 2005, CoNLL.

[14]  François Bavaud,et al.  Markov Associativities , 2005, J. Quant. Linguistics.

[15]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[16]  Jean-Cédric Chappelier,et al.  Textual similarities based on a distributional approach , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.