ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bi- or multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extraction from comparable corpora. It consists of tools bundled in two workflows: (1) alignment of comparable documents and extraction of parallel sentences and (2) extraction and bilingual mapping of terms and named entities. The toolkit pairs similar bilingual comparable documents and extracts parallel sentences and bilingual terminological and named entity dictionaries from comparable corpora. This demonstration focuses on the English, Latvian, Lithuanian, and Romanian languages.

[1]  Nikola Ljubešić,et al.  Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages , 2012 .

[2]  Marcis Pinnis Latvian and Lithuanian Named Entity Recognition with TildeNER , 2012, LREC.

[3]  Bogdan Babych,et al.  Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents , 2012, ESIRMT/HyTra@EACL.

[4]  Inguna Skadina,et al.  A Collection of Comparable Corpora for Under-resourced Languages , 2010, Baltic HLT.

[5]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[6]  Holger Schwenk,et al.  Parallel sentence generation from comparable corpora for improved SMT , 2011, Machine Translation.

[7]  Emmanuel Morin,et al.  An Effective Compositional Model for Lexical Alignment , 2008, IJCNLP.

[8]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[10]  Philipp Koehn,et al.  Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.

[11]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[12]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[13]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[14]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[15]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[16]  Emmanuel Morin,et al.  Bilingual Lexicon Extraction from Comparable Corpora Enhanced with Parallel Corpora , 2011, BUCC@ACL.

[17]  Benjamin K. Tsou,et al.  Building a Large English-Chinese Parallel Corpus from Comparable Patents and its Experimental Application to SMT , 2010 .

[18]  Bruno Pouliquen,et al.  Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC , 2002, CICLing.

[19]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[20]  Sabine Hunsicker,et al.  Hybrid Parallel Sentence Mining from Comparable Corpora , 2012, EAMT.

[21]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[22]  Bogdan Babych,et al.  Development and Application of a Cross-language Document Comparability Metric , 2012, LREC.

[23]  Holger Schwenk,et al.  On the Use of Comparable Corpora to Improve SMT performance , 2009, EACL.