Extracting raw material for a German subcategorization lexicon from newspaper text

This paper is about extracting evidence for syntactic subcategorization phenomena from German newspaper text. The purpose of this work is to support and partly automatize the construction of a subcategorization lexicon for NLP, similar, for example, to comlex. We here report on the extraction of verb lists and sample sentences illustrating syntactic construction possibilities. The lists are ordered by subcategorization types; they are manually screened to remove noise, and then used to automatically produce proto-entries of the lexicon. Since no phrasal parsing is yet available for German, we use part-of-speech shapes (a regular grammar over categorially and morphosyntactically annotated word forms) and lemma information; to reduce the noise produced by general part-of-speech shapes, we have deened \constraining contexts" and use a context-dependent modeling. The retrieval results contain less than 5% of noise. Moreover, we can retrieve speciic types of syntactic information which cannot be found in any traditional dictionary: we can, for example, identify verbs with \obligatory coherent" innnitives (cf. Haider 1993]). We explain the principles and procedures of our extraction work, discuss the case of innnitive-taking verbs and assess the results obtained on the rst 3.000 readings extracted.