Annotating and Querying a Treebank of Suboptimal Structures

Existing treebanks of written language, as e.g., TIGER [2], TuBa-D/Z [11], Penn Treebank [1] etc., usually consist of sentences that can be considered as grammatically well-formed. The SINBAD treebank we present here covers a completely new domain, namely suboptimal syntactic structures, i.e., sentences which are neither fully grammatical nor completely ungrammatical, but merely suboptimal.1 The treebank consists of a collection of German sentences that are rated suboptimal or ungrammatical in the literature, as well as of sentences drawn from our own experimental work on graded grammaticality judgments. In the literature, these structures are usually compared with grammatical structures which express the same meaning, and for ease of comparison these were sometimes included in the treebank as well. With this data collection we provide access to negative evidence which does not occur in ordinary corpora of written or spoken language. It is characteristic for suboptimal structures that these data are judged incoherently varying between different speakers and in different contexts. It is therefore important to provide a systematic collection of these judgments in order to allow researchers better access to past judgements on the phenomena they are interested in and thus contribute towards greater consistency, even in tricky cases. Since most work in syntactic theory is based on suboptimal or ungrammatical structures, the treebank aims at providing linguists with a data basis for their research. This requires a rich syntactic annotation with linguistically relevant concepts. The linguistic framework of the annotation is that of generative grammar in the sense that the trees are strictly binary branching and contain traces and empty categories. The