PolyU CBS-Comp at SemEval-2021 Task 1: Lexical Complexity Prediction (LCP)

In this contribution, we describe the system presented by the PolyU CBS-Comp Team at the Task 1 of SemEval 2021, where the goal was the estimation of the complexity of words in a given sentence context. Our top system, based on a combination of lexical, syntactic, word embeddings and Transformers-derived features and on a Gradient Boosting Regressor, achieves a top correlation score of 0.754 on the subtask 1 for single words and 0.659 on the subtask 2 for multiword expressions.

[1]  Lucia Specia,et al.  SemEval 2016 Task 11: Complex Word Identification , 2016, *SEMEVAL.

[2]  Philippe Blache,et al.  Evaluating Language Complexity in Context : New Parameters for a Constraint-Based Model , 2011 .

[3]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[4]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[5]  Emmanuele Chersoni,et al.  Not all arguments are processed equally: a distributional model of argument complexity , 2021, Language Resources and Evaluation.

[6]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7]  Marcos Zampieri,et al.  CompLex - A New Corpus for Lexical Complexity Predicition from Likert Scale Data , 2020, READI.

[8]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[9]  Tal Linzen,et al.  A Neural Model of Adaptation in Reading , 2018, EMNLP.

[10]  Emmanuele Chersoni,et al.  Logical Metonymy in a Distributional Model of Sentence Comprehension , 2017, *SEMEVAL.

[11]  Emmanuele Chersoni,et al.  Towards a Distributional Model of Semantic Complexity , 2016, CL4LC@COLING 2016.

[12]  Mikael Parkvall,et al.  The simplicity of creoles in a cross-linguistic perspective , 2008 .

[13]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[14]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Lucia Specia,et al.  Complex Word Identification: Challenges in Data Annotation and System Performance , 2017, NLP-TEA@IJCNLP.

[17]  J. McWhorter,et al.  The worlds simplest grammars are creole grammars , 2001 .

[18]  Marc Brysbaert,et al.  Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English , 2009, Behavior research methods.

[19]  Dominique Brunato,et al.  Sentence Complexity in Context , 2021, CMCL.

[20]  Dominique Brunato,et al.  That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models , 2021, CMCL.

[21]  Lucia Specia,et al.  A Report on the Complex Word Identification Shared Task 2018 , 2018, BEA@NAACL-HLT.

[22]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[23]  Marcos Zampieri,et al.  SemEval-2021 Task 1: Lexical Complexity Prediction , 2021, SEMEVAL.