论文信息 - Bootstrapping a Database of German Multi-word Expressions

Bootstrapping a Database of German Multi-word Expressions

We pre-classified 32,000 entries from the Worterbuch der deutschen Idiomatik (Schemann, 1993) using an inductive description of POS sequences in conjunction with a Brill Tagger trained on manually tagged idiomatic entries. This process assigned categories to 86% of entries with 88% accuracy. Further manual classification resulted in a database of multi-word expressions where each entry is associated with a sequence of POS-tag/token pairs. The second phase of our project, currently underway, addresses the association of a sequence of POS-tag/token pairs with a corpus example. To this end, we generate a weighted finite state transducer from the sequences for each entry and apply a finite state filter to the corpus. The filter will extract those sequences in the corpus that correspond to the longest match of the multi-word expression.

Alexander Geyken

[1] Gregory Grefenstette,et al. Regular expressions for language engineering , 1996, Natural Language Engineering.

[2] Jean Senellart. Reconnaissance automatique des entrées du lexique-grammaire des phrases figées , 1998 .

[3] Jordan Boyd-Graber,et al. Automatic classification of multi-word expressions in print dictionaries , 2004 .

[4] Frank Smadja,et al. Retrieving Collocations from Text: Xtract , 1993, CL.

[5] Hans Schemann. Deutsche Idiomatik : die deutschen Redewendungen im Kontext , 1993 .

[6] Eric Brill,et al. Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[7] Michael Oakes,et al. Statistics for Corpus Linguistics , 1998 .