论文信息 - Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair - 字舞流文

Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair

This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain “.hr” and the Slovene top-level domain “.si”, and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs.

Antonio Toral | Nikola Ljubesic | Miquel Esplà-Gomis | Filip Klubicka | Sergio Ortiz-Rojas | Sergio Ortiz Rojas | Nikola Ljubesic | Antonio Toral | M. Esplà-Gomis | Filip Klubicka

[1] Hae-Chang Rim,et al. An Empirical Study on Web Mining of Parallel Data , 2010, COLING.

[2] Ben Hutchinson,et al. Intrinsic versus Extrinsic Evaluations of Parsing Systems , 2003 .

[3] Alain Désilets,et al. WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content , 2008, TC.

[4] Philipp Koehn,et al. Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[5] Qingsheng Zhu,et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns , 2009, ACL/IJCNLP.

[6] Alexandra Antonova,et al. Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text , 2011, BUCC@ACL.

[7] Ying Zhang,et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web , 2006, ECIR.

[8] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[9] Andy Way,et al. Domain adaptation of statistical machine translation with domain-focused web crawling , 2014, Language Resources and Evaluation.

[10] Srinivas Bangalore,et al. A Scalable Approach to Building a Parallel Corpus from the Web , 2011, INTERSPEECH.

[11] Jian-Yun Nie,et al. Parallel Web text mining for cross-language IR , 2000, RIAO.

[12] Jörg Tiedemann,et al. News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[13] Nikola Ljubesic,et al. Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites , 2014, LREC.

[14] Richard M. Schwartz,et al. Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[15] Masao Utiyama,et al. Mining Parallel Texts from Mixed-Language Web Pages , 2009, MTSUMMIT.

[16] Andy Way,et al. Extrinsic evaluation of web-crawlers in machine translation: a study on Croatian-English for the tourism domain , 2014, EAMT.

[17] Silvia Bernardini,et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[18] Kristina Toutanova,et al. Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[19] Srinivas Bangalore,et al. Harvesting Parallel Text in Multiple Languages with Limited Supervision , 2012, COLING.

[20] Dan Tufis,et al. Empirical Methods for Exploiting Parallel Texts , 2002, Lit. Linguistic Comput..

[21] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22] Iñaki San Vicente,et al. PaCo2: A Fully Automated tool for gathering Parallel Corpora from the Web , 2012, LREC.

[23] Xiaoyi Ma,et al. BITS: a method for bilingual text search over the Web , 1999, MTSUMMIT.

[24] Jian-Yun Nie,et al. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[25] Philipp Koehn,et al. Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.

[26] Matthew G. Snover,et al. A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[27] Yanhui Feng,et al. Parallel Sentences Mining From The Web , 2009 .

[28] Philip Resnik,et al. Parallel strands: a preliminary investigation into mining the Web for bilingual text , 1998, AMTA.

[29] Noah A. Smith,et al. The Web as a Parallel Corpus , 2003, CL.

[30] Dragos Stefan Munteanu,et al. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[31] Christopher D. Manning,et al. A Simple and Effective Hierarchical Phrase Reordering Model , 2008, EMNLP.

[32] Vít Suchomel,et al. Efficient Web Crawling for Large Text Corpora , 2012 .

[33] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[34] Nadir Durrani,et al. A Joint Sequence Translation Model with Integrated Reordering , 2011, ACL.

[35] Anne Schneider,et al. Comparing intrinsic and extrinsic evaluation of MT output in a dialogue system , 2010, IWSLT.

[36] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[37] Mikel L. Forcada,et al. Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[38] Gregor Thurmair,et al. A modular open-source focused crawler for mining monolingual and bilingual corpora from the web , 2013, BUCC@ACL.