Searching through Prague Dependency Treebank Conception and Architecture

In this paper we present our conception of searching in syntactically annotated corpora. In the first part we briefly introduce the Prague Dependency Treebank—one of the key projects at our center. Its goal is to build a large corpus of Czech with a rich annotation scheme. After that we describe architecture and usage of the software system we have originally developed for searching through the treebank. This tool works in the Internet environment and has a graphically oriented hardware independent user interface. As a theoretical background we also present a proof that the subtree counting problem is -complete. 1 The Prague Dependency Treebank The Prague Dependency Treebank (PDT) ([1]), ([2]) and ([3]) is a manually annotated corpus of Czech. The corpus size is approx. 1.5 million words. The texts are annotated in three layers: the morphological layer, where lemmas and tags are being annotated based on their context the analytical layer, which roughly corresponds to the surface syntax of the sentence the tectogrammatical layer, or linguistic meaning of the sentence in its context Unique annotation for every token in every sentence is used on all three layers. Most of them are annotated manually, using the necessary human judgment. 1.1 The Morphological Layer The annotation at the morphological layer is an unstructured classification of the individual tokens (words and punctuation) of the utterance into morphological classes (morphological tags) and lemmas. The tagset size used is 4257 with about 1100 different tags actually appearing in the PDT.