A Subcategorisation Lexicon for German Verbs induced from a Lexicalised PCFG

The paper presents a large-scale computational subcategorisation lexicon for several thousand German verbs. The lexical entries were obtained by unsupervised learning in a statistical grammar framework: a German context-free grammar containing frame-predicting grammar rules and information about lexical heads was trained on 18.7 million words of a large German newspaper corpus. We developed a simple methodology to utilise frequency distributions in the lexicalised version of the probabilistic grammar for inducing syntactic verb frame descriptions. The frame definition is variable with respect to the inclusion of prepositional phrase refinement. An evaluation against a manual dictionary justifies the utilisation of the machine-readable lexicon as a valuable component for supporting NLP-tasks. As to our knowledge, no former computational approach has obtained a subcategorisation lexicon for German comparable in size (the number of verbs in the lexicon), restriction (no limit concerning the frequencies of the verbs), or verified reliability (successful extensive evaluation against dictionary).