This paper deals with the automatic identification and extraction of idiomatic expressions from parallel corpora. Idiomatic expressions, a subset of multiword expressions (Sag et al., 2001), henceforth called MWEs, are lexical items consisting of multiple simplex words that are generally not fully compositional and therefore problematic to analyse and process in applications related to natural language processing. In the past two decades, there has been a growing interest in the automatic identification and extraction of idiomatic expressions and other kinds of MWEs. Among the numerous approaches to automatically extract these expressions from text, it has been shown that the use of parallel corpora delivers satisfying results. In this work, we use statistical association measures to extract idiomatic expressions and improve the resulting ranking by using the alignment information provided by parallel corpora. This approach is based on the work of Villada Moiron and Tiedemann (2006). In contrast to their approach, which was done on Dutch, we will perform our experiments on English. In addition, we expand the set of MWE candidates by adding other structures than VERB PP to the set of extracted idioms, and we will test the method on a different corpus, the OpenSubtitles2012 dataset, a collection of TV series and movie subtitles, compiled from the website OpenSubtitles (http: //www.opensubtitles.org/).
[1]
Jörg Tiedemann,et al.
Parallel Data, Tools and Interfaces in OPUS
,
2012,
LREC.
[2]
Miriam Butt.
The light verb jungle : still hacking away
,
2010
.
[3]
Martha W. Evens,et al.
Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins
,
2003
.
[4]
I. D. Melamed.
Measuring Semantic Entropy
,
1997
.
[5]
Joakim Nivre,et al.
MaltParser: A Data-Driven Parser-Generator for Dependency Parsing
,
2006,
LREC.
[6]
Helmut Schmidt,et al.
Probabilistic part-of-speech tagging using decision trees
,
1994
.
[7]
Stefan Evert,et al.
The Statistics of Word Cooccur-rences: Word Pairs and Collocations
,
2004
.
[8]
Timothy Baldwin,et al.
Multiword Expressions: A Pain in the Neck for NLP
,
2002,
CICLing.
[9]
Jörg Tiedemann,et al.
Identifying idiomatic expressions using automatic word-alignment
,
2006
.
[10]
Philipp Koehn,et al.
Europarl: A Parallel Corpus for Statistical Machine Translation
,
2005,
MTSUMMIT.
[11]
Paola Merlo,et al.
Automatic distinction of arguments and modifiers: the case of prepositional phrases
,
2001,
CoNLL.