Tools for Collocation Extraction: Preferences for Active vs. Passive

We present and partially evaluate procedures for the extraction of noun+verb collocation candidates from German text corpora, along with their morphosyntactic preferences, especially for the active vs. passive voice. We start from tokenized, tagged, lemmatized and chunked text, and we use extraction patterns formulated in the CQP corpus query language. We discuss the results of a precision evaluation, on administrative texts from the European Union: we find a considerable amount of specialized collocations, as well as general ones and complex predicates; overall the precision is considerably higher than that of a statistical extractor used as a baseline.

[1]  Afsaneh Fazly,et al.  Automatically Constructing a Lexicon of Verb Phrase Idiomatic Combinations , 2006, EACL.

[2]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[3]  Julia Ritz Collocation Extraction: Needs, Feeds and Results of an Extraction System for German , 2006 .

[4]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[5]  Stefan Evert The Statistical Analysis of Morphosyntactic Distributions , 2004, LREC.

[6]  Eric Wehrli,et al.  Multilingual Collocation Extraction: Issues and Solutions , 2006 .

[7]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[8]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[9]  François Rousselot,et al.  A Hybrid Approach to Extracting and Classifying Verb+Noun Constructions , 2008, LREC.

[10]  Ulrich Heid,et al.  Extraction tools for collocations and their morphosyntactic specificities , 2006, LREC.

[11]  Hannah Kermes,et al.  Off-line (and on-line) text analysis for computational lexicography , 2003 .

[12]  Richard Poole,et al.  Oxford collocations dictionary for students of English , 2009 .

[13]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[14]  Brigitte Krenn,et al.  The usual suspects: data-oriented models for identification und representation of lexical collocations , 1999 .

[15]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.