The case of InterCorp, a multilingual parallel corpus

This paper introduces InterCorp, a parallel corpus including texts in Czech and 27 other languages, available for online searches via a web interface. After discussing some issues and merits of a multilingual resource we argue that it has an important role especially for languages with fewer native speakers, supporting both comparative research and studies of the language from the perspective of other languages. We proceed with an overview of the corpus — the strategy and criteria for including new texts, the representation of available languages and text types, linguistic annotation, and a sketch of pre-processing issues. Finally, we present the search interface and suggest some research opportunities.

[1]  Jonas Kuhn,et al.  PARALLEL LFG GRAMMARS ON PARALLEL CORPORA : A BASE FOR PRACTICAL TRIANGULATION , 2008 .

[2]  Pu-Jen Cheng,et al.  Translating unknown queries with web corpora for cross-language information retrieval , 2004, SIGIR '04.

[3]  Christopher C. Yang,et al.  Automatic generation of English/Chinese thesaurus based on a parallel corpus in laws , 2003, J. Assoc. Inf. Sci. Technol..

[4]  Mark Steedman,et al.  Building Deep Dependency Structures using a Wide-Coverage CCG Parser , 2002, ACL.

[5]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[6]  Bruno Pouliquen,et al.  Massive multi lingual corpus compilation: Acquis Communautaire and totale , 2005 .

[7]  James Allan,et al.  SIGIR 2004: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 25-29, 2004 , 2004, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[8]  Wolfgang Teubert,et al.  Corpus Linguistics and Lexicography , 2001 .

[9]  François Maniez Wolfgang Teubert (ed.). 2007. Text Corpora and Multilingual Lexicography , 2008 .

[10]  Philip Resnik,et al.  An Unsupervised Method for Word Sense Tagging using Parallel Corpora , 2002, ACL.

[11]  Alexandr Rosen Mediating between Incompatible Tagsets , 2010 .

[12]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[13]  S. Johansson Seeing through Multilingual Corpora: On the Use of Corpora in Contrastive Studies , 2007 .

[14]  Stephan Vogel,et al.  The web as a platform to build machine translation resources , 2009, IWIC '09.

[15]  Rada Mihalcea,et al.  Parallel texts , 2005, Natural Language Engineering.

[16]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[17]  Ondrej Dusek,et al.  The Joy of Parallelism with CzEng 1.0 , 2012, LREC.

[18]  Tomáš Káňa Deminutiva a deminutivní vyjádření v češtině, němčině aangličtině – hledání hranic , 2011 .

[19]  Pavel Rychlý,et al.  Manatee/Bonito - A Modular Corpus Manager , 2007, RASLAN.

[20]  Alexandr Rosen,et al.  Building a multilingual parallel corpus for human users , 2012, LREC.

[21]  P. Çankaya The exploration of multilingualism: Development of research on L3, multilingualism and multiple language acquisition* , 2009 .

[22]  Wolfgang Teubert Text corpora and multilingual lexicography , 2007 .

[23]  Orphée De Clercq,et al.  Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus , 2011 .

[24]  Anil Kumar Singh,et al.  Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs , 2005, ParallelText@ACL.

[25]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.