Pos-Patterns or Syntax? Comparing Methods for Extracting Word Combinations

This paper reports on work carried out in the framework of an ongoing project aimed at building an online, corpus-based lexicographic resource for Italian Word Combinations. Our aim is to compare two of the most commonly used methods for the automatic extraction of word combinations from corpora, with a view to evaluate their performance – and ultimately their efficacy – with respect to the task of acquiring word combinations for inclusion in the lexicographic combinatory resource. 1. WORD COMBINATIONS: LEXICOGRAPHY AND NLP It is widely acknowledged that lexicographers‘ introspection alone cannot provide comprehensive information about word meaning and usage, and that investigation of language in use is fundamental for any reliable lexicographic work (Atkins and Rundell 2008). This is even more true for dictionaries that record the combinatorial behaviour of words, where the lexicographic task is to detect the typical combinations a word participates in. In fact, it was much harder to study lexical combinatorics empirically before the advent of large corpora and the definition of statistical techniques for the analysis of word associations (Hanks 2012). This paper reports on work carried out in the framework of an ongoing project called CombiNet aimed at building an online, corpus-based lexicographic resource for Italian Word Combinations. We use the term Word Combinations (WoCs) to encompass both Multiword Expressions (MWEs) – namely WoCs characterised by different degrees of fixedness and idiomaticity that act as a single unit at some level of linguistic analysis, such as idioms, phrasal lexemes, collocations, preferred combinations (Calzolari et al. 2002, Sag et al. 31 PRIN Project 2010-2011 Word Combinations in Italian (n. 20105B3HE8) funded by the Italian Ministry of Education, University and Research (MIUR). URL: http://combinet.humnet.unipi.it.

[1]  S. Gries Phraseology and linguistic theory : a brief survey , 2007 .

[2]  Malvina Nissim,et al.  Mapping the constructicon with SYMPAThy. Italian word combinations between fixedness and productivity , 2015, NetWordS.

[3]  Ralph Grishman,et al.  Towards Best Practice for Multiword Expressions in Computational Lexicons , 2002, LREC.

[4]  Alessandro Lenci,et al.  LexIt: A Computational Resource on Italian Argument Structure , 2012, LREC.

[5]  Violeta Seretan Syntax-Based Collocation Extraction , 2010 .

[6]  Alessandro Lenci,et al.  Extracting Terms with EXTra , 2016 .

[7]  R. Lew The Oxford Guide to Practical Lexicography , 2009 .

[8]  Malvina Nissim,et al.  SYMPAThy: Towards a comprehensive approach to the extraction of Italian Word Combinations , 2014 .

[9]  Piunno Valentina,et al.  Studio comparativo dei dizionari combinatori dell'italiano e di altre lingue europee , 2013 .

[10]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[11]  Merrill D. Benson,et al.  The BBI Combinatory Dictionary of English , 1989 .

[12]  Carlos Ramisch,et al.  Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering , 2007, EMNLP.

[13]  Sara Castagnoli Extracting MWEs from Italian corpora: A case study for refining the POS-pattern methodology , 2014, MWE@EACL.

[14]  Carlos Ramisch,et al.  mwetoolkit: a Framework for Multiword Expression Identification , 2010, LREC.

[15]  Patrick Hanks Corpus Evidence and Electronic Lexicography , 2012 .

[16]  Guy Aston,et al.  Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-compliant Corpus of Newspaper Italian , 2004, LREC.

[17]  Vincenzo Lo Cascio Dizionario Combinatorio Italiano , 2013 .

[18]  S. Gries 1. Phraseology and linguistic theory: A brief survey , 2008 .