Arabic Multiword Expressions

In this work we address the problem of automatic multiword expression identification and classification in Arabic running text. We propose a supervised machine learning approach using a relatively small manually annotated data augmented with an increasing size of automatically tagged data, labeled using a deterministic pattern-matching algorithm. In particular, in this chapter, we show the impact of explicitly modeling morpho-syntactic features calculated on the detection task. Moreover, we present the first work to address the problem of handling gapped verb-noun constructions in running text. We show that using the syntactic construction classes as labels improves identification results for verb-noun and verb-particle constructions. Our best identification algorithm yields an F-measure of 61.4%, which is a significant improvement over our baseline of 48.8%.

[1]  Giuliano Lancioni,et al.  Idiomatic MWEs and Machine Translation. A Retrieval and Representation Model: the AraMWE Project , 2012, AMTA 2012.

[2]  Mona Diab,et al.  Verb noun construction MWE token supervised classification , 2009 .

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Daisuke Kawahara,et al.  Construction of an Idiom Corpus and its Application to Idiom Identification based on WSD Incorporating Idiom-Specific Features , 2008, EMNLP.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[7]  Yulia Tsvetkov,et al.  Identification of Multiword Expressions by Combining Multiple Linguistic Information Sources , 2014, Computational Linguistics.

[8]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2015, Lecture Notes in Computer Science.

[9]  Suzanne Stevenson,et al.  Distinguishing Subtypes of Multiword Expressions Using Linguistically-Motivated Statistical Measures , 2007 .

[10]  Mona T. Diab Improved Arabic Base Phrase Chunking with a new enriched POS tag set , 2007, SEMITIC@ACL.

[11]  Timothy Baldwin,et al.  An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[14]  Madhav Krishna,et al.  Handling Sparsity for Verb Noun MWE Token Classification , 2009 .

[15]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[16]  Mona T. Diab,et al.  Building an Arabic Multiword Expressions Repository , 2012, SPMRL@ACL 2012.

[17]  Marine Carpuat,et al.  Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation , 2010, NAACL.

[18]  Mona T. Diab,et al.  Unsupervised Classification of Verb Noun Multi-Word Expression Tokens , 2009, CICLing.

[19]  Anoop Sarkar,et al.  A Clustering Approach for Nearly Unsupervised Recognition of Nonliteral Language , 2006, EACL.

[20]  Mona T. Diab,et al.  Arabic Named Entity Recognition: An SVM-based approach , 2008 .

[21]  Timothy Baldwin,et al.  How to pick out token instances of English verb-particle constructions , 2010, Lang. Resour. Evaluation.

[22]  Yuji Matsumoto,et al.  Fast Methods for Kernel-Based Text Analysis , 2003, ACL.

[23]  Josef van Genabith,et al.  Automatic Extraction of Arabic Multiword Expressions , 2010, MWE@COLING.

[24]  Caroline Sporleder,et al.  Unsupervised Recognition of Literal and Non-Literal Use of Idiomatic Expressions , 2009, EACL.

[25]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[26]  Afsaneh Fazly,et al.  Pulling their Weight: Exploiting Syntactic Forms for the Automatic Identification of Idiomatic Expressions in Context , 2007 .

[27]  Yulia Tsvetkov,et al.  Extraction of Multi-word Expressions from Small Parallel Corpora , 2010, COLING.

[28]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.