Cross-Lingual Text Fragment Alignment Using Divergence from Randomness

This paper describes an approach to automatically align fragments of texts of two documents in different languages. A text fragment is a list of continuous sentences and an aligned pair of fragments consists of two fragments in two documents, which are content-wise related. Cross-lingual similarity between fragments of texts is estimated based on models of divergence from randomness. A set of aligned fragments based on the similarity scores are selected to provide an alignment between sections of the two documents. Similarity measures based on divergence show strong performance in the context of cross-lingual fragment alignment in the performed experiments.

[1]  Daniel Marcu,et al.  A Phrase-Based HMM Approach to Document/Abstract Alignment , 2004, EMNLP.

[2]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[3]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[4]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[5]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[6]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[7]  Justin Zobel,et al.  Redundant documents and search effectiveness , 2005, CIKM '05.

[8]  Christof Monz,et al.  The QMUL system description for IWSLT 2010 , 2010, IWSLT.

[9]  Wai Lam,et al.  MEAD - A Platform for Multidocument Multilingual Text Summarization , 2004, LREC.

[10]  Bruno Pouliquen,et al.  Automatic Identification of Document Translations in Large Multilingual Document Collections , 2006, ArXiv.

[11]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[12]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[13]  Jakob Uszkoreit,et al.  Large Scale Parallel Document Mining for Machine Translation , 2010, COLING.

[14]  Hermann Ney,et al.  Improved Alignment Models for Statistical Machine Translation , 1999, EMNLP.

[15]  Alberto Barrón-Cedeño,et al.  On Cross-lingual Plagiarism Analysis using a Statistical Model , 2008, PAN.

[16]  Justin Zobel,et al.  A Scalable System for Identifying Co-derivative Documents , 2004, SPIRE.