论文信息 - Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis

Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis

The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a standard and in a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated. Since we assume the two representations are completely independent apart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we first demonstrate that the correlations detected between the two versions of the corpus are significantly higher than random, and hence that a representation based on such features does capture statistical patterns that should reflect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and significantly superior to LSI on the same data.

[1] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[2] Susan T. Dumais,et al. Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing , 1998 .

[3] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[4] Nello Cristianini,et al. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[5] Colin Fyfe,et al. Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[6] Mark Girolami,et al. Document Classification Employing the Fisher Kernel Derived from Probabilistic Hierarchic Corpus Rep , 2001 .

[7] U. Germann. Aligned Hansards of the 36th Parliament of Canada , 2001 .

[8] Michael I. Jordan,et al. Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[9] Michael I. Jordan,et al. Kernel independent component analysis , 2003 .

[10] Mark A. Girolami,et al. A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections , 2004, Journal of Intelligent Information Systems.