ANDI at SemEval-2021 Task 1: Predicting complexity in context using distributional models, behavioural norms, and lexical resources

In this paper we describe our participation in the Lexical Complexity Prediction (LCP) shared task of SemEval 2021, which involved predicting subjective ratings of complexity for English single words and multi-word expressions, presented in context. Our approach relies on a combination of distributional models, both context-dependent and context-independent, together with behavioural norms and lexical resources.

[1]  Akira Utsumi,et al.  Exploring What Is Encoded in Distributional Word Vectors: A Neurobiologically Motivated Analysis , 2020, Cogn. Sci..

[2]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[3]  Marcos Zampieri,et al.  Predicting Lexical Complexity in English Texts , 2021, ArXiv.

[4]  Marcos Zampieri,et al.  CompLex - A New Corpus for Lexical Complexity Predicition from Likert Scale Data , 2020, READI.

[5]  Marc Brysbaert,et al.  The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words , 2011, Behavior Research Methods.

[6]  T. Rogers,et al.  Semantic diversity: A measure of semantic ambiguity based on variability in the contextual usage of words , 2012, Behavior Research Methods.

[7]  Geoff Hollis,et al.  Extrapolating human judgments from skip-gram vector representations of word meaning , 2017, Quarterly journal of experimental psychology.

[8]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[9]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[10]  Saif Mohammad,et al.  Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words , 2018, ACL.

[11]  Samantha F. McCormick,et al.  Word prevalence norms for 62,000 English lemmas , 2018, Behavior research methods.

[12]  Marc Brysbaert,et al.  Subtlex-UK: A New and Improved Word Frequency Database for British English , 2014, Quarterly journal of experimental psychology.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Rémi Louf,et al.  Transformers : State-ofthe-art Natural Language Processing , 2019 .

[15]  Lucia Specia,et al.  Inferring Psycholinguistic Properties of Words , 2016, NAACL.

[16]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[17]  Nederlandse Taalunie,et al.  Common European Framework of Reference for Languages: Learning, Teaching, Assessment , 2007 .

[18]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[19]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[20]  M. Brysbaert,et al.  The Lancaster Sensorimotor Norms: multidimensional measures of perceptual and action strength for 40,000 English words , 2019, Behavior Research Methods.

[21]  Armand Stefan Rotaru ANDI @ CONcreTEXT: Predicting Concreteness in Context for English and Italian using Distributional Models and Behavioural Norms (short paper) , 2020, EVALITA.

[22]  M. Brysbaert,et al.  Age-of-acquisition ratings for 30,000 English words , 2012, Behavior research methods.

[23]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[24]  Marcos Zampieri,et al.  SemEval-2021 Task 1: Lexical Complexity Prediction , 2021, SEMEVAL.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.