Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking

A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition would struggle to beat a simple lookup baseline system and argue for a more purposespecific evaluation scheme.

[1]  Mona T. Diab,et al.  Arabic Multiword Expressions , 2014, Language, Culture, Computation.

[2]  Timothy Baldwin,et al.  Detecting Non-compositional MWE Components using Wiktionary , 2014, EMNLP.

[3]  Qun Liu,et al.  Topic-based term translation models for statistical machine translation , 2016, Artif. Intell..

[4]  Timothy Baldwin,et al.  Using Distributional Similarity of Multi-way Translations to Predict Multiword Expression Compositionality , 2014, EACL.

[5]  Ronan Collobert,et al.  Phrase Representations for Multiword Expressions , 2016, MWE@ACL.

[6]  Behrang Q. Zadeh,et al.  The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions , 2017, MWE@EACL.

[7]  Christopher D. Manning,et al.  Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French , 2011, EMNLP.

[8]  Xiao Sun,et al.  Mining Semantic Orientation of Multiword Expression from Chinese Microblogging with Discriminative Latent Model , 2013, 2013 International Conference on Asian Language Processing.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Aravind K. Joshi,et al.  Using Information about Multi-word Expressions for the Word-Alignment Task , 2006 .

[11]  David Lewis,et al.  Self-tuning ongoing terminology extraction retrained on terminology validation decisions , 2016 .

[12]  Christopher D. Manning,et al.  Parsing Models for Identifying Multiword Expressions , 2013, CL.

[13]  Noah A. Smith,et al.  UW-CSE at SemEval-2016 Task 10: Detecting Multiword Expressions and Supersenses using Double-Chained Conditional Random Fields , 2016, SemEval@NAACL-HLT.

[14]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Timothy Baldwin,et al.  Multilingual Deep Lexical Acquisition for HPSGs via Supertagging , 2006, EMNLP.

[17]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[18]  Patrick Watrin,et al.  Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing , 2012, ACL.

[19]  Josef van Genabith,et al.  Automatic Extraction of Arabic Multiword Expressions , 2010, MWE@COLING.

[20]  Veronika Vincze,et al.  Multiword Expressions and Named Entities in the Wiki50 Corpus , 2011, RANLP.

[21]  Martin Emms,et al.  Measuring the Compositionality of Collocations via Word Co-occurrence Vectors: Shared Task System Description , 2011 .

[22]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[23]  Noah A. Smith,et al.  Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut , 2014, TACL.

[24]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[25]  Lidia S. Chao,et al.  Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics , 2013, IIS.

[26]  Xiaodong Zeng,et al.  Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model , 2015, SIGHAN@IJCNLP.

[27]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[28]  Ari Rappoport,et al.  Multi-Word Expression Identification Using Sentence Surface Features , 2009, EMNLP.

[29]  Timothy Baldwin,et al.  A Word Embedding Approach to Predicting the Compositionality of Multiword Expressions , 2015, NAACL.

[30]  Yulia Tsvetkov,et al.  Extraction of Multi-word Expressions from Small Parallel Corpora , 2010, COLING.