Improving Bilingual Search Performance Using Compact Full-Text Indices

Machine Translation tasks must tackle the ever-increasing sizes of parallel corpora, requiring space and time efficient solutions to support them. Several approaches were developed based on full-text indices, such as suffix arrays, with important time and space achievements. However, for supporting bilingual tasks, the search time efficiency of such indices can be improved using an extra layer for the text alignment. Additionally, their space requirements can be significantly reduced using more compact indices. We propose a search procedure on top of a compact bilingual framework that improves bilingual search response time, while having a space efficient representation of aligned parallel corpora.

[1]  Johannes Fischer,et al.  Suffix Arrays on Words , 2007, CPM.

[2]  Adam Lopez,et al.  Hierarchical Phrase-Based Translation with Suffix Arrays , 2007, EMNLP.

[3]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[4]  Gonzalo Navarro,et al.  Reorganizing compressed text , 2008, SIGIR '08.

[5]  Gonzalo Navarro,et al.  Self-indexing Natural Language , 2008, SPIRE.

[6]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[7]  Chris Callison-Burch,et al.  Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases , 2005, ACL.

[8]  Ying Zhang,et al.  An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora , 2005, EAMT.

[9]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[10]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[11]  Chris Callison-Burch,et al.  A compact data structure for searchable translation memories , 2005, EAMT.

[12]  Julien Bourdaillet,et al.  TransSearch: from a bilingual concordancer to a translation finder , 2010, Machine Translation.

[13]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[14]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[15]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[16]  Adam Lopez Tera-Scale Translation Models via Pattern Matching , 2008, COLING.

[17]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[18]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[19]  Luís Gomes,et al.  Parallel texts alignment , 2009 .