AUTOMATIC ACQUISITION OF BILINGUAL LANGUAGE RESOURCES

This paper discusses methods for automatic acquisition of bilingual corpora from the Web. Given the vast number of documents available online, the Web could be considered an excellent pool for extraction of valuable data for linguistic purposes. Therefore, methods for creating such corpora, especially when targeting less-resourced languages like Greek, can be of great value. Besides presenting a general workflow for constructing collections from the Web, this article describes our work to produce collections of English/Greek comparable documents in the “Political News”, “Technological News”, “Sport News”, and “Renewable Energy” domains and parallel resources in the “Environment” and “Labour Legislation” domains.

[1]  Martin Braschler,et al.  Multilingual Information Retrieval Based on Document Alignment Techniques , 1998, ECDL.

[2]  Takehito Utsuro,et al.  Semi-automatic compilation of bilingual lexcion entries from cross-lingually relevant news articles on WWW news sites , 2002, AMTA.

[3]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[4]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[5]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[6]  Tao Tao,et al.  Mining comparable bilingual text corpora for cross-language information integration , 2005, KDD '05.

[7]  Andy Way,et al.  Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation , 2011, EAMT.

[8]  Koraljka Golub,et al.  Focused crawler software package , 2007 .

[9]  Radu Ion,et al.  ON-LINE COMPILATION OF COMPARABLE CORPORA AND THEIR EVALUATION , 2010 .

[10]  Martti Juhola,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007, TOIS.

[11]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[12]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[13]  Mikel L. Forcada,et al.  Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[14]  Martti Juhola,et al.  Focused web crawling in the acquisition of comparable corpora , 2008, Information Retrieval.

[15]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[16]  Dragos Stefan Munteanu,et al.  Exploiting Comparable Corpora , 2013, Building and Using Comparable Corpora.

[17]  Raivis Skadiņš,et al.  Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation , 2010 .

[18]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.