HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation

We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task.

[1]  Jan Hajič,et al.  The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech , 2007, ACL 2007.

[2]  Ondrej Dusek,et al.  The Joy of Parallelism with CzEng 1.0 , 2012, LREC.

[3]  Alexandr Rosen,et al.  The case of InterCorp, a multilingual parallel corpus , 2012 .

[4]  Matt Post,et al.  Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing , 2012, WMT@NAACL-HLT.

[5]  Fabienne Braune,et al.  Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora , 2010, COLING.

[6]  Pushpak Bhattacharyya,et al.  Hindi POS Tagger Using Naive Stemming : Harnessing Morphological Information Without Extensive Linguistic Knowledge , 2008 .

[7]  Ondrej Bojar,et al.  Data Issues in English-to-Hindi Machine Translation , 2010, LREC.

[8]  Daniel Zeman,et al.  English–Hindi Translation in 21 Days , 2008 .

[9]  Tony McEnery,et al.  EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation , 2002, LREC.

[10]  Zdenek Zabokrtský,et al.  TectoMT: Modular NLP Framework , 2010, IceTAL.

[11]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[12]  Jan Pomikálek Removing Boilerplate and Duplicate Content from Web Corpora , 2011 .

[13]  Ondrej Bojar,et al.  TrTok: A Fast and Trainable Tokenizer for Natural Languages , 2012, Prague Bull. Math. Linguistics.

[14]  Rico Sennrich,et al.  Iterative, MT-based Sentence Alignment of Parallel Texts , 2011, NODALIDA.

[15]  Rico Sennrich,et al.  Extrinsic evaluation of sentence alignment systems , 2012 .

[16]  Zdenek Zabokrtský,et al.  Language Richness of the Web , 2012, LREC.

[17]  Vít Suchomel,et al.  Efficient Web Crawling for Large Text Corpora , 2012 .