Detecting Multiword Expression Type Helps Lexical Complexity Assessment

Multiword expressions (MWEs) represent lexemes that should be treated as single lexical units due to their idiosyncratic nature. Multiple NLP applications have been shown to benefit from MWE identification, however the research on lexical complexity of MWEs is still an under-explored area. In this work, we re-annotate the Complex Word Identification Shared Task 2018 dataset of Yimam et al. (2017), which provides complexity scores for a range of lexemes, with the types of MWEs. We release the MWE-annotated dataset with this paper, and we believe this dataset represents a valuable resource for the text simplification community. In addition, we investigate which types of expressions are most problematic for native and non-native readers. Finally, we show that a lexical complexity assessment system benefits from the information about MWE types.

[1]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[2]  Marine Carpuat,et al.  Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation , 2010, NAACL.

[3]  Ekaterina Kochmar,et al.  Complex Word Identification as a Sequence Labelling Task , 2019, ACL.

[4]  Alexander F. Gelbukh,et al.  Complex Word Identification: Convolutional Neural Network vs. Feature Engineering , 2018, BEA@NAACL-HLT.

[5]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[6]  Stephen G. Pulman,et al.  An Unsupervised Ranking Model for Noun-Noun Compositionality , 2012, *SEM@NAACL-HLT.

[7]  Carlos Ramisch,et al.  Survey: Multiword Expression Processing: A Survey , 2017, CL.

[8]  Matthew Shardlow,et al.  A Comparison of Techniques to Automatically Identify Complex Words. , 2013, ACL.

[9]  Mark Davies The 385+ million word Corpus of Contemporary American English (1990―2008+): Design, architecture, and linguistic insights , 2009 .

[10]  Christian Biemann,et al.  CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups , 2017, IJCNLP.

[11]  Matthew Shardlow,et al.  Neural Text Simplification of Clinical Letters with a Domain Specific Phrase Table , 2019, ACL.

[12]  Maja Popovic Complex Word Identification Using Character n-grams , 2018, BEA@NAACL-HLT.

[13]  Ekaterina Kochmar,et al.  CAMB at CWI Shared Task 2018: Complex Word Identification with Ensemble-Based Voting , 2018, BEA@NAACL-HLT.

[14]  Dieter Kastovsky,et al.  English word-formation , 1986 .

[15]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[16]  Xiaojun Wan,et al.  Automatic Text Simplification , 2018, Computational Linguistics.

[17]  Noah A. Smith,et al.  Comprehensive Annotation of Multiword Expressions in a Social Web Corpus , 2014, LREC.

[18]  Lucia Specia,et al.  SemEval 2016 Task 11: Complex Word Identification , 2016, *SEMEVAL.

[19]  Timothy Baldwin,et al.  Bayesian Text Segmentation for Index Term Identification and Keyphrase Extraction , 2012, COLING.

[20]  Wei Xu,et al.  A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification , 2018, EMNLP.

[21]  Lucia Specia,et al.  A Report on the Complex Word Identification Shared Task 2018 , 2018, BEA@NAACL-HLT.

[22]  Lucia Specia,et al.  SV000gg at SemEval-2016 Task 11: Heavy Gauge Complex Word Identification with System Voting , 2016, SemEval@NAACL-HLT.

[23]  Ray Jackendoff,et al.  The Architecture of the Language Faculty , 1996 .

[24]  N. Ellis,et al.  Formulaic Language in Native and Second Language Speakers: Psycholinguistics, Corpus Linguistics, and TESOL , 2008 .

[25]  Gustavo Paetzold Reliable Lexical Simplification for Non-Native Speakers , 2015, HLT-NAACL.

[26]  Yulia Clausen,et al.  Metaphors in Text Simplification: To change or not to change, that is the question , 2019, BEA@ACL.