CzEng 0.9: Large Parallel Treebank with Rich Annotation

CzEng 0.9: Large Parallel Treebank with Rich Annotation We describe our ongoing efforts in collecting a Czech-English parallel corpus CzEng. The paper provides full details on the current version 0.9 and focuses on its new features: (1) data from new sources were added, most importantly a few hundred electronically available books, technical documentation and also some parallel web pages, (2) the full corpus has been automatically annotated up to the tectogrammatical layer (surface and deep syntactic analysis), (3) sentence segmentation has been refined, and (4) several heuristic filters to improve corpus quality were implemented. In total, we provide a sentence-aligned automatic parallel treebank of about 8.0 million sentences, 93 million English and 82 million Czech words. CzEng 0.9 is freely available for non-commercial research purposes.

[1]  Jan Hajič,et al.  The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech , 2007, ACL 2007.

[2]  Zdenek Zabokrtský,et al.  Feature Engineering in Maximum Spanning Tree Dependency Parser , 2007, International Conference on Text, Speech and Dialogue.

[3]  David Mareček,et al.  Automatic Alignment of Tectogrammatical Trees from Czech-English Parallel Corpus , 2008 .

[4]  Václav Klimeš Analytical and Tectogrammatical Analysis of a Natural Language , 2006 .

[5]  Petr Pajas,et al.  TectoMT: Highly Modular MT System with Tectogrammatics Used as Transfer Layer , 2008, WMT@ACL.

[6]  Peter Beňa Filmové titulky jako zdroj paralelních textů , 2009 .

[7]  Natalia Klyueva,et al.  UMC 0.1: Czech-Russian-English Multilingual Corpus , 2008 .

[8]  Zdenek Zabokrtský,et al.  Czech Named Entity Corpus and SVM-based Recognizer , 2009, NEWS@IJCNLP.

[9]  P. Sgall,et al.  Generativní popis jazyka a česká deklinace , 1967 .

[10]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[11]  Petr Sgall,et al.  The Meaning Of The Sentence In Its Semantic And Pragmatic Aspects , 1986 .

[12]  Ondrej Bojar,et al.  CzEng 0.7: Parallel Corpus with Community-Supplied Translations , 2008, LREC.

[13]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[14]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[15]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[16]  Jan Stepánek Post-annotation checking of Prague Dependency Treebank 2.0 data , 2006, Prague Bull. Math. Linguistics.

[17]  Ondrej Bojar,et al.  English-Czech MT in 2008 , 2009, WMT@EACL.

[18]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[19]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.