Predicting the Components of German Nominal Compounds

Word prediction systems (such as those embedded in most current augmentative and alternative communication systems) aim to predict what a user wants to type next on the basis of corpus-extracted n-gram counts. Good performance of such a system depends crucially on the size and quality of the underlying lexicon. Compounding is a common cross-linguistic way to form complex words. In German as in some other languages, compounds are commonly written as single orthographic strings. Because compounding is a very productive process, this leads to a considerable amount of orthographic words that cannot, even in principle, be listed in a lexicon. We present a solution to this problem based on the idea that compounds should not be predicted as units, but as the concatenation of their components. In particular, we designed a word prediction system in which the prediction of German two-element nominal compounds (by far the most common compound type in German) is split into the prediction of the modifier (left element) and the prediction of the head (right element). Both components are predicted on the basis of uni- and bigram statistics collected treating modifiers and heads as independent units, and on the basis of the type frequency of nouns in head and modifier context in the training corpus. We show that our system brings a dramatic improvement in keystroke saving rate over a word prediction scheme in which compounds are treated as units. In particular, our results indicate that the type frequency of nouns in head/modifier context in the training corpus is a very good predictor of which nouns will occur in head/modifier context in new text.