Issues of Projectivity in the Prague Dependency Treebank

In the present paper we discuss some issues connected with the condition of projectivity in a dependency based description of language (see Sgall, Hajičová, and Panevová (1986), Hajičová, Partee, and Sgall (1998)), with a special regard to the annotation scheme of the Prague Dependency Treebank (PDT, see Hajič (1998)). After a short Introduction (Section 1), the condition of projectivity is discussed in more detail in Section 2, presenting its formal definition and formulating an algorithm for testing this condition on a subtree (Section 2.1); the introduction of the condition of projectivity in a formal description of language is briefly substantiated in Section 2.2. and some problematic cases are discussed in Section 2.3. In Section 3, a preliminary classification into three main groups and several subgroups of Czech non-projective constructions on the analytical level is presented (Section 3.1), with illustrations of each subgroup in Section 3.2. A discussion of (surface) non-projectivities viewed from the perspectives of the underlying (tectogrammatical) structures is given in Section 4; the classification outlined in Section 4.1 reflects the types of deviations from projectivity caused by topic-focus articulation (TFA). In Section 4.2 we examine the motivation and factors of non-projective constructions. The treatment of non-projective constructions in the annotation scenario of PDT is presented in Section 5. In the Conclusion (Section 6) we summarize the results and outline some directions for further research in this domain. The present contribution is an enlarged and slightly modified version of the paper Veselá, Havelka, and Hajičová (2004). 1 Condition of projectivity The objective of the present paper is to analyze the property of projectivity, a condition formally defined by Marcus (1965) and postulated for dependency trees (see e.g., Kunze (1975); on projectivity in the tectogrammatical level of FGD, see e.g. Sgall, Hajičová, and Panevová (1986), pp. 238 ff.) in view of a complex multilevel account of language structure and, more specifically, as reflected in the multilayered annotation scenario of the Prague Dependency Treebank. The Prague Dependency Treebank is a subset of texts taken from the Czech National Corpus (CNC); each randomly chosen sample consisting of 50 sentences of a coherent text is annotated on three layers of annotation: (i) the morphemic (POS) layer with about 2000 tags for the highly inflectional Czech language; (ii) a layer of ‘analytic’ (“surface”) syntax (analytic representations, AR in the sequel): about 100,000 Czech sentences, i.e. 2000 samples of texts each consisting of 50 sentences of a continuous text have been assigned dependency tree structures; (iii) the tectogrammatical (underlying) syntactic layer: tectogrammatical tree structures (TGTSs) are assigned to a subset of the set tagged according to (ii); the current phase has resulted in 1000 samples of 50 sentences each; the TGTSs are again based on dependency syntax, and the following principles are observed: (a) only autosemantic (lexical) words have nodes of their own; function words, as far as semantically relevant, are reflected by parts of complex node labels (with the exception of coordinating conjunctions); (b) nodes are added in case of deletions on the surface level; (c) the condition of projectivity is met (i.e. no crossing of edges is allowed); (d) tectogrammatical functions (‘functors’) such as Actor/Bearer, Patient, Addressee, Origin, Effect, different kinds of Circumstantials are assigned; (e) basic features of topic-focus articulation (TFA) are introduced; (f) elementary coreference links (both grammatical and textual) are indicated. A TGTS node label consists of: (a) the lexical value of the word; (b) its ‘(morphological) grammatemes’ (i.e. the values of morphological categories); (c) its ‘functors’ (with a more subtle differentiation of syntactic relations by means of ‘syntactic grammatemes’ (e.g. ‘in’, ‘at’, ‘on’, ‘under’); (d) the attribute of Contextual Boundness (topic-focus articulation); (e) values concerning intersentential links. In Figure 1 we give a (rather simplified) illustrative example of a TGTS, which represents the preferred reading of the sentence 1.