Parallel-Wiki: A Collection of Parallel Sentences Extracted from Wikipedia

Parallel corpora are essential resources for certain Natural Language Processing tasks such as Statistical Machine Translation. However, the existing publically available parallel corpora are specific to limited genres or domains, mostly juridical (e.g. JRC-Acquis) and medical (e.g. EMEA), and there is a lack of such resources for the general domain. This paper addresses this issue and presents a collection of parallel sentences extracted from the entire Wikipedia collection of documents for the following pairs of languages: English-German, English-Romanian and English-Spanish. Our work began with the processing of the publically available Wikipedia static dumps for the three languages in- volved. The existing text was stripped of the specific mark-up, cleaned of non- textual entries like images or tables and sentence-split. Then, corresponding documents for the above mentioned pairs of languages were identified using the cross-lingual Wikipedia links embedded within the documents themselves. Considering them comparable documents, we further employed a publically available tool named LEXACC, developed during the ACCURAT project, to extract parallel sentences from the preprocessed data. LEXACC assigns a score to each extracted pair, which is a measure of the degree of parallelism between the two sentences in the pair. These scores allow researchers to select only those sentences having a certain degree of parallelism suited for their intended purposes. This resource is publically available at: http://ws.racai.ro:9191/repository/search/?q=Parallel+Wiki

[1]  Chris Quirk,et al.  Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction , 2007 .

[2]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[3]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[4]  Holger Schwenk,et al.  Parallel sentence generation from comparable corpora for improved SMT , 2011, Machine Translation.

[5]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[6]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[7]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[8]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Mehdi Mohammadi,et al.  Building Bilingual Parallel Corpora Based on Wikipedia , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[10]  Eiichiro Sumita,et al.  Method for Building Sentence-Aligned Corpus from Wikipedia , 2008 .

[11]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[12]  Christoph Tillmann,et al.  A Beam-Search Extraction Algorithm for Comparable Data , 2009, ACL.

[13]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[14]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[15]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[16]  Dan Tufis,et al.  RACAI’s Linguistic Web Services , 2008, LREC.

[17]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[18]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[19]  Radu Ion,et al.  Romanian WordNet : Current State , New Applications and Prospects , 2008 .

[20]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[21]  Sabine Hunsicker,et al.  Hybrid Parallel Sentence Mining from Comparable Corpora , 2012, EAMT.