Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair

This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain “.hr” and the Slovene top-level domain “.si”, and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs.

[1]  Hae-Chang Rim,et al.  An Empirical Study on Web Mining of Parallel Data , 2010, COLING.

[2]  Ben Hutchinson,et al.  Intrinsic versus Extrinsic Evaluations of Parsing Systems , 2003 .

[3]  Alain Désilets,et al.  WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content , 2008, TC.

[4]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[5]  Qingsheng Zhu,et al.  Mining Bilingual Data from the Web with Adaptively Learnt Patterns , 2009, ACL/IJCNLP.

[6]  Alexandra Antonova,et al.  Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text , 2011, BUCC@ACL.

[7]  Ying Zhang,et al.  Automatic Acquisition of Chinese-English Parallel Corpus from the Web , 2006, ECIR.

[8]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[9]  Andy Way,et al.  Domain adaptation of statistical machine translation with domain-focused web crawling , 2014, Language Resources and Evaluation.

[10]  Srinivas Bangalore,et al.  A Scalable Approach to Building a Parallel Corpus from the Web , 2011, INTERSPEECH.

[11]  Jian-Yun Nie,et al.  Parallel Web text mining for cross-language IR , 2000, RIAO.

[12]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[13]  Nikola Ljubesic,et al.  Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites , 2014, LREC.

[14]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[15]  Masao Utiyama,et al.  Mining Parallel Texts from Mixed-Language Web Pages , 2009, MTSUMMIT.

[16]  Andy Way,et al.  Extrinsic evaluation of web-crawlers in machine translation: a study on Croatian-English for the tourism domain , 2014, EAMT.

[17]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[18]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[19]  Srinivas Bangalore,et al.  Harvesting Parallel Text in Multiple Languages with Limited Supervision , 2012, COLING.

[20]  Dan Tufis,et al.  Empirical Methods for Exploiting Parallel Texts , 2002, Lit. Linguistic Comput..

[21]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22]  Iñaki San Vicente,et al.  PaCo2: A Fully Automated tool for gathering Parallel Corpora from the Web , 2012, LREC.

[23]  Xiaoyi Ma,et al.  BITS: a method for bilingual text search over the Web , 1999, MTSUMMIT.

[24]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[25]  Philipp Koehn,et al.  Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.

[26]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[27]  Yanhui Feng,et al.  Parallel Sentences Mining From The Web , 2009 .

[28]  Philip Resnik,et al.  Parallel strands: a preliminary investigation into mining the Web for bilingual text , 1998, AMTA.

[29]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[30]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[31]  Christopher D. Manning,et al.  A Simple and Effective Hierarchical Phrase Reordering Model , 2008, EMNLP.

[32]  Vít Suchomel,et al.  Efficient Web Crawling for Large Text Corpora , 2012 .

[33]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[34]  Nadir Durrani,et al.  A Joint Sequence Translation Model with Integrated Reordering , 2011, ACL.

[35]  Anne Schneider,et al.  Comparing intrinsic and extrinsic evaluation of MT output in a dialogue system , 2010, IWSLT.

[36]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[37]  Mikel L. Forcada,et al.  Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[38]  Gregor Thurmair,et al.  A modular open-source focused crawler for mining monolingual and bilingual corpora from the web , 2013, BUCC@ACL.