Identification of Idiomatic Expressions Using Parallel Subtitle Corpora

This paper deals with the automatic identification and extraction of idiomatic expressions from parallel corpora. Idiomatic expressions, a subset of multiword expressions (Sag et al., 2001), henceforth called MWEs, are lexical items consisting of multiple simplex words that are generally not fully compositional and therefore problematic to analyse and process in applications related to natural language processing. In the past two decades, there has been a growing interest in the automatic identification and extraction of idiomatic expressions and other kinds of MWEs. Among the numerous approaches to automatically extract these expressions from text, it has been shown that the use of parallel corpora delivers satisfying results. In this work, we use statistical association measures to extract idiomatic expressions and improve the resulting ranking by using the alignment information provided by parallel corpora. This approach is based on the work of Villada Moiron and Tiedemann (2006). In contrast to their approach, which was done on Dutch, we will perform our experiments on English. In addition, we expand the set of MWE candidates by adding other structures than VERB PP to the set of extracted idioms, and we will test the method on a different corpus, the OpenSubtitles2012 dataset, a collection of TV series and movie subtitles, compiled from the website OpenSubtitles (http: //www.opensubtitles.org/).