论文信息 - Bilingual sentence matching using Kernel CCA

Bilingual sentence matching using Kernel CCA

The problem of matching samples between two data sets is a fundamental task in unsupervised learning. In this paper we propose an algorithm based on statistical dependency between the data sets to solve the matching problem in a general case when samples in both data sets have different feature representations. As a concrete example, we consider the task of sentence-level alignment of parallel corpus based on monolingual data. Multilingual text collections with sentence-level alignment are required by statistical machine translation methods. We show how statistical dependencies between feature representations of partially aligned (e.g., paragraph-level alignment) corpora can be used to learn sentence-level alignment in a data-driven way. Our novel matching algorithm based on Kernel Canonical Correlation Analysis (KCCA) outperforms an earlier algorithm using linear CCA.

Sami Virpioja | Abhishek Tripathi | Arto Klami

[1] Dan Klein,et al. Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[2] Samuel Kaski,et al. Using dependencies to pair samples for multi-view learning , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Michael I. Jordan,et al. Kernel independent component analysis , 2003 .

[4] Le Song,et al. A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[5] Nello Cristianini,et al. Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.

[6] H. Kuhn. The Hungarian method for the assignment problem , 1955 .

[7] I. Dan Melamed,et al. Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[8] John Shawe-Taylor,et al. Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[9] Kurt Hornik,et al. kernlab - An S4 Package for Kernel Methods in R , 2004 .

[10] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[11] Le Song,et al. Kernelized Sorting , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Kishore Papineni,et al. Why Inverse Document Frequency? , 2001, NAACL.