论文信息 - FicTree: A Manually Annotated Treebank of Czech Fiction

FicTree: A Manually Annotated Treebank of Czech Fiction

We present a manually annotated treebank of Czech fiction, intended to serve as an addendum to the Prague Dependency Treebank. The treebank has only 166,000 tokens, so it does not serve as a good basis for training of NLP tools, but added to the PDT training data, it can help improve the annotation of texts of fiction. We describe the composition of the corpus, the annotation process including inter-annotator agreement. On the newly created data and the data of the PDT, we performed a number of experiments with parsers (TurboParser, Parsito, MSTParser and MaltParser). We observe that the extension of PDT training data by a part of the new treebank actually does improve the results of the parsing of literary texts. We investigate cases where parsers agree on a different annotation than the manual one.

Tomas. Jelinek

[1] Joakim Nivre,et al. MaltParser: A Data-Driven Parser-Generator for Dependency Parsing , 2006, LREC.

[2] Fernando Pereira,et al. Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[3] Jan Haji. Complex Corpus Annotation: The Prague Dependency Treebank , 2005 .

[4] Jan Hajic,et al. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition , 2014, ACL.

[5] Yan Huang,et al. Bias and Agreement in Syntactic Annotations , 2016, ArXiv.

[6] Noah A. Smith,et al. Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers , 2013, ACL.

[7] Jan Hajic,et al. Parsing Universal Dependency Treebanks using Neural Networks and Search-Based Oracle Milan , 2016 .