Language Resources for Italian: towards the Development of a Corpus of Annotated Italian Multiword Expressions

English. This paper describes the first resource annotated for multiword expressions (MWEs) in Italian. Two versions of this dataset have been prepared: the first with a fast markup list of out-of-context MWEs, and the second with an in-context annotation, where the MWEs are entered with their contexts. The paper also discusses annotation issues and reports the inter-annotator agreement for both types of annotations. Finally, the results of the first exploitation of the new resource, namely the automatic extraction of Italian MWEs, are presented. Italiano. Questo contributo descrive la prima risorsa italiana annotatata con polirematiche. Sono state preparate due versioni del dataset: la prima con una lista di polirematiche senza contesto, e la seconda con annotazione in contesto. Il contributo discute le problematiche emerse durante l’annotazione e riporta il grado di accordo tra annotatori per entrambi i tipi di annotazione. Infine vengono presentati i risultati del primo impiego della nuova risorsa, ovvero l’estrazione automatica di polirematiche

[1]  Noah A. Smith,et al.  Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut , 2014, TACL.

[2]  Aline Villavicencio,et al.  The availability of verb-particle constructions in lexical resources: How much is enough? , 2005, Comput. Speech Lang..

[3]  S. Evert,et al.  Can we do better than frequency ? A case study on extracting PP-verb collocations , 2001 .

[4]  Michael Oakes,et al.  Statistics for Corpus Linguistics , 1998 .

[5]  B. Haddow,et al.  Machine Translation Summit XIV 2-6 September 2013 , Nice , France Workshop Proceedings : MULTI-WORD UNITS IN MACHINE TRANSLATION AND TRANSLATION TECHNOLOGIES , 2013 .

[6]  Carlos Ramisch,et al.  Never-Ending Multiword Expressions Learning , 2015, MWE@NAACL-HLT.

[7]  Noah A. Smith,et al.  Comprehensive Annotation of Multiword Expressions in a Social Web Corpus , 2014, LREC.

[8]  Bahar Salehi,et al.  Predicting the Compositionality of Multiword Expressions Using Translations in Multiple Languages , 2013, *SEMEVAL.

[9]  Colin Bannard A Measure of Syntactic Flexibility for Automatically Identifying Multiword Expressions in Corpora , 2007 .

[10]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[11]  Stefan Evert,et al.  Corpora and collocations , 2007 .

[12]  Afsaneh Fazly,et al.  Automatically learning semantic knowledge about multiword predicates , 2007, Lang. Resour. Evaluation.

[13]  Amalia Todirascu,et al.  MULTIWORD UNITS TRANSLATION EVALUATION IN MACHINE TRANSLATION: ANOTHER PAIN IN THE NECK? , 2015 .

[14]  Adam Kilgarriff,et al.  Large Linguistically-Processed Web Corpora for Multiple Languages , 2006, EACL.

[15]  Federico Sangati,et al.  PARSEME Survey on MWE Resources , 2016, LREC.

[16]  Carlos Ramisch,et al.  Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering , 2007, EMNLP.

[17]  Carlos Ramisch,et al.  mwetoolkit: a Framework for Multiword Expression Identification , 2010, LREC.

[18]  John DeNero,et al.  Identifying Phrasal Verbs Using Many Bilingual Corpora , 2013, EMNLP.

[19]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[20]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[21]  Timothy Baldwin,et al.  Multiword Expressions , 2010, Handbook of Natural Language Processing.

[22]  Pavel Rychlý,et al.  A Lexicographer-Friendly Association Score , 2008, RASLAN.

[23]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[24]  Carlos Ramisch,et al.  Alignment-based extraction of multiword expressions , 2010, Lang. Resour. Evaluation.

[25]  SmadjaFrank Retrieving collocations from text , 1993 .

[26]  Eric Wehrli,et al.  Syntactic concordancing and multi-word expression detection , 2013, Int. J. Data Min. Model. Manag..

[27]  Afsaneh Fazly,et al.  Unsupervised Type and Token Identification of Idiomatic Expressions , 2009, CL.

[28]  Sylviane Granger,et al.  Phraseology: An Interdisciplinary Perspective , 2008 .

[29]  Shiva Taslimipoor,et al.  Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations , 2016, CICLing.