Czech National Corpus in 2020: Recent Developments and Future Outlook

The paper overviews the state of implementation of the Czech National Corpus (CNC) in all the main areas of its operation: corpus compilation, annotation, application development and user services. As the focus is on the recent development, some of the areas are described in more detail than the others. Close attention is paid to the data collection and, in particular, to the description of web application development. This is not only because CNC has recently seen a significant progress in this area, but also because we believe that end-user web applications shape the way linguists and other scholars think about the language data and about the range of possibilities they offer. This consideration is even more important given the variability of the CNC corpora.

[1]  Michal Kren,et al.  SYN2015: Representative Corpus of Contemporary Written Czech , 2016, LREC.

[2]  Tomás Jelínek Improvements to Dependency Parsing Using Automatic Simplification of Data , 2014, LREC.

[3]  Jan Hajic,et al.  Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition , 2014, ACL.

[4]  Michal Křen Recent Developments in the Czech National Corpus , 2015 .

[5]  Tomás Machálek Word at a Glance: Modular Word Profile Aggregator , 2020, LREC.

[6]  Pavel Rychlý,et al.  Manatee/Bonito - A Modular Corpus Manager , 2007, RASLAN.

[7]  Tomas. Jelinek FicTree: A Manually Annotated Treebank of Czech Fiction , 2017, ITAT.

[8]  Zuzana Komrsková,et al.  New Spoken Corpora of Czech: ORTOFON and DIALEKT , 2017 .

[9]  Karel Kucera,et al.  Corpus of 19th-century Czech Texts: Problems and Solutions , 2014, LREC.

[10]  Alexandr Rosen,et al.  The case of InterCorp, a multilingual parallel corpus , 2012 .

[11]  Tomás Jelínek Using a Database of Multiword Expressions in Dependency Parsing , 2019, TSD.

[12]  Vladimír Benko,et al.  Aranea: Yet Another Family of (Comparable) Web Corpora , 2014, TSD.

[13]  V. Cvrček,et al.  From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA , 2018, Corpus Linguistics and Linguistic Theory.

[14]  Tomás Machálek,et al.  KonText: Advanced and Flexible Corpus Query Interface , 2020, LREC.

[15]  Michal Kren,et al.  The SYN-series corpora of written Czech , 2014, LREC.

[16]  Alexandr Rosen,et al.  Building a multilingual parallel corpus for human users , 2012, LREC.

[17]  Michal Skrabal,et al.  The Translation Equivalents Database (Treq) as a lexicographer’s Aid , 2017 .