We pre-classified 32,000 entries from the Worterbuch der deutschen Idiomatik (Schemann, 1993) using an inductive description of POS sequences in conjunction with a Brill Tagger trained on manually tagged idiomatic entries. This process assigned categories to 86% of entries with 88% accuracy. Further manual classification resulted in a database of multi-word expressions where each entry is associated with a sequence of POS-tag/token pairs. The second phase of our project, currently underway, addresses the association of a sequence of POS-tag/token pairs with a corpus example. To this end, we generate a weighted finite state transducer from the sequences for each entry and apply a finite state filter to the corpus. The filter will extract those sequences in the corpus that correspond to the longest match of the multi-word expression.
[1]
Gregory Grefenstette,et al.
Regular expressions for language engineering
,
1996,
Natural Language Engineering.
[2]
Jean Senellart.
Reconnaissance automatique des entrées du lexique-grammaire des phrases figées
,
1998
.
[3]
Jordan Boyd-Graber,et al.
Automatic classification of multi-word expressions in print dictionaries
,
2004
.
[4]
Frank Smadja,et al.
Retrieving Collocations from Text: Xtract
,
1993,
CL.
[5]
Hans Schemann.
Deutsche Idiomatik : die deutschen Redewendungen im Kontext
,
1993
.
[6]
Eric Brill,et al.
Some Advances in Transformation-Based Part of Speech Tagging
,
1994,
AAAI.
[7]
Michael Oakes,et al.
Statistics for Corpus Linguistics
,
1998
.