English–Arabic collocation extraction to enhance Arabic collocation identification

Bilingual collocation extraction could improve the performance of monolingual extraction. This is especially true for the English–Arabic pair, as difficulties of Arabic collocation extraction can be overcome. We present in this paper two novel approaches for extracting both monolingual and bilingual collocations. The monolingual extraction approach is hybrid, based on linguistic patterns and statistical measures. We propose during statistical filtering to combine vector-based measures with different association measures via a voting procedure. The bilingual extraction capitalizes on different cues (position, frequency, cross-language correspondence between POS-patterns, distribution, translation). It allows enhancing the monolingual collocation extraction by considering not only collocation equivalents with direct translation. Indeed, it can validate unconfirmed collocations because they translate confirmed ones. The results showed, in particular, how the extraction of Arabic collocations can be improved by extracting English–Arabic ones. The precision of extracting Arabic collocations moved upward, respectively, from about 86 to 96%.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Margarita Alonso Ramos,et al.  Using bilingual word-embeddings for multilingual collocation extraction , 2017, MWE@EACL.

[3]  Carlos Ramisch,et al.  Introduction to the special issue on multiword expressions: From theory to practice and use , 2013, TSLP.

[4]  Markus Egg,et al.  A Large Automatically-Acquired All-Words List of Multiword Expressions Scored for Compositionality , 2018, LREC.

[5]  Ulrich Heid Extracting terminologically relevant collocations from German technical texts , 1999 .

[6]  Pavel Pecina,et al.  Combining Association Measures for Collocation Extraction , 2006, ACL.

[7]  Josef van Genabith,et al.  Automatic Extraction of Arabic Multiword Expressions , 2010, MWE@COLING.

[8]  Mariano Sigman,et al.  Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database , 2016, ArXiv.

[9]  Carlos Ramisch,et al.  mwetoolkit: a Framework for Multiword Expression Identification , 2010, LREC.

[10]  John DeNero,et al.  The Complexity of Phrase Alignment Problems , 2008, ACL.

[11]  Tiberiu Boros,et al.  GBD-NER at PARSEME Shared Task 2018: Multi-Word Expression Detection Using Bidirectional Long-Short-Term Memory Networks and Graph-Based Decoding , 2018, LAW-MWE-CxG@COLING.

[12]  Timothy Baldwin,et al.  A Word Embedding Approach to Predicting the Compositionality of Multiword Expressions , 2015, NAACL.

[13]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[14]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[15]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[16]  Driss Aboutajdine,et al.  A Multi-Word Term Extraction Program for Arabic Language , 2008, LREC.

[17]  Simone Teufel,et al.  Corpus-based Method for Automatic Identification of Support Verbs for Nominalizations , 1995, EACL.

[18]  Abdulgabbar Saif,et al.  An Automatic Collocation Extraction from Arabic Corpus , 2011 .

[19]  Robert Dale,et al.  United Nations General Assembly Resolutions : a six-language parallel corpus , 2009 .

[20]  Darren Pearce,et al.  Synonymy in collocation extraction , 2001 .

[21]  Tomas Lehecka Collocation and colligation , 2015 .

[22]  A. Mokrane Représentation de collections de documents textuels : application à la caractérisation thématique , 2006 .

[23]  Antoine Doucet,et al.  Neural Networks for Multi-Word Expression Detection , 2017, MWE@EACL.

[24]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[25]  Eric Wehrli,et al.  Multilingual collocation extraction with a syntactic parser , 2009, Lang. Resour. Evaluation.

[26]  Pierre Zweigenbaum,et al.  Identifying bilingual Multi-Word Expressions for Statistical Machine Translation , 2012, LREC.

[27]  Nasredine Semmar A Hybrid Approach for Automatic Extraction of Bilingual Multiword Expressions from Parallel Corpora , 2018, LREC.

[28]  Chiraz Ben Othmane Zribi,et al.  A Syntactico-Semantic Method for Arabic Collocations Extraction , 2017, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA).

[29]  Rodolfo Delmonte,et al.  Italian-Arabic domain terminology extraction from parallel corpora , 2015 .

[30]  Pushpak Bhattacharyya,et al.  Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features , 2015, ICON.

[31]  Ruslan Mitkov,et al.  A flexible framework for collocation retrieval and translation from parallel and comparable corpora , 2018 .

[32]  Yingying Wang,et al.  Arabic Collocation Extraction Based on Hybrid Methods , 2017, CCL.

[33]  Jan Snajder,et al.  Evolving New Lexical Association Measures Using Genetic Programming , 2008, ACL.

[34]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[35]  Tomas Krilavicius,et al.  Hybrid Approach for Automatic Identification of Multi-Word Expressions in Lithuanian , 2016, Baltic HLT.