The MARCELL Legislative Corpus

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

[1]  Alina Wróblewska,et al.  Semi-Supervised Neural System for Tagging, Parsing and Lematization , 2018, CoNLL Shared Task.

[2]  Jan Hajic,et al.  UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[3]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[4]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[5]  Veronika Vincze,et al.  E-magyar - A Digital Language Processing System , 2018, LREC.

[6]  Marko Grobelnik,et al.  Event registry: learning about world events from news , 2014, WWW.

[7]  Marko Grobelnik,et al.  News Across Languages - Cross-Lingual Document Similarity and Event Tracking , 2015, J. Artif. Intell. Res..

[8]  Marko Grobelnik,et al.  Annotating documents with relevant Wikipedia concepts , 2017 .

[9]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[10]  Jakub Waszczuk Harnessing the CRF Complexity with Domain-Specific Constraints. The Case of Morphosyntactic Tagging of a Highly Inflected Language , 2012, COLING.

[11]  Mojmír Dočekal,et al.  Slavic Languages in the Perspective of Formal Grammar , 2015 .

[12]  Marcin Woliński,et al.  Morfeusz 2 – analizator i generator fleksyjny dla języka polskiego , 2017 .

[13]  Matthijs Douze,et al.  Learning Joint Multilingual Sentence Representations with Neural Machine Translation , 2017, Rep4NLP@ACL.

[14]  Martin Potthast,et al.  CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2018, CoNLL.

[15]  Márton Makrai,et al.  One format to rule them all - The emtsv pipeline for Hungarian , 2019, LAW@ACL.

[16]  Timothy Dozat,et al.  Universal Dependency Parsing from Scratch , 2019, CoNLL.

[17]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[18]  Ralf Steinberger,et al.  JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool , 2012, LREC.

[19]  Stefan Daniel Dumitrescu,et al.  NLP-Cube: End-to-End Raw Text Processing With Neural Networks , 2018, CoNLL.

[20]  Denis Turdakov,et al.  Word sense disambiguation methods , 2010, Programming and Computer Software.