Tools for automatic lexicon maintenance: acquisition, error correction, and the generation of missing values
暂无分享,去创建一个
The paper describes the algorithmic methods used in a German monolingual lexicon project dealing with a multimillion entry lexicon. We describe the usability of different information which can be extracted from the lexicon: For German nouns and adjectives, candidates for their inflexion classes are automatically detected. Forms which do not fit in these classes are good error candidates. A n-gram model is used to find unusual combinations of letters which also indicate an error or foreign language entries. Regularity is used especially for compounds to get inflection information. In all algorithms, frequency information is used to select terms for correction. Quality information is attached to all entries. Generation and use of this quality information gives an automatic control over both the data and the correctness of the algorithms. The algorithms are designed to be language independent. Language specific data (as inflexion classes and n-grams) should be available or relatively easy to obtain.
[1] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .
[2] Klaus Wothke,et al. Morphologically Based Automatic Phonetic Transcription , 1993, IBM Syst. J..
[3] Uwe Quasthoff. Projekt Der Deutsche Wortschatz , 1997, GLDV-Jahrestagung.