Tools for automatic lexicon maintenance: acquisition, error correction, and the generation of missing values

The paper describes the algorithmic methods used in a German monolingual lexicon project dealing with a multimillion entry lexicon. We describe the usability of different information which can be extracted from the lexicon: For German nouns and adjectives, candidates for their inflexion classes are automatically detected. Forms which do not fit in these classes are good error candidates. A n-gram model is used to find unusual combinations of letters which also indicate an error or foreign language entries. Regularity is used especially for compounds to get inflection information. In all algorithms, frequency information is used to select terms for correction. Quality information is attached to all entries. Generation and use of this quality information gives an automatic control over both the data and the correctness of the algorithms. The algorithms are designed to be language independent. Language specific data (as inflexion classes and n-grams) should be available or relatively easy to obtain.