Alejandro Mosquera at SemEval-2021 Task 1: Exploring Sentence and Word Features for Lexical Complexity Prediction

This paper revisits feature engineering approaches for predicting the complexity level of English words in a particular context using regression techniques. Our best submission to the Lexical Complexity Prediction (LCP) shared task was ranked 3rd out of 48 systems for sub-task 1 and achieved Pearson correlation coefficients of 0.779 and 0.809 for single words and multi-word expressions respectively. The conclusion is that a combination of lexical, contextual and semantic features can still produce strong baselines when compared against human judgement.

[1]  M. Brysbaert,et al.  Adding part-of-speech information to the SUBTLEX-US word frequencies , 2012, Behavior Research Methods.

[2]  Ekaterina Kochmar,et al.  Complex Word Identification as a Sequence Labelling Task , 2019, ACL.

[3]  R. Gunning The Technique of Clear Writing. , 1968 .

[4]  Ricardo Baeza-Yates,et al.  The Impact of Lexical Simplification by Verbal Paraphrases for People with and without Dyslexia , 2013, CICLing.

[5]  Lucia Specia,et al.  A Report on the Complex Word Identification Shared Task 2018 , 2018, BEA@NAACL-HLT.

[6]  Xiaolong Li,et al.  An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[7]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[8]  Marcos Zampieri,et al.  CompLex - A New Corpus for Lexical Complexity Predicition from Likert Scale Data , 2020, READI.

[9]  Advaith Siddharthan,et al.  Syntactic Simplification and Text Cohesion , 2006 .

[10]  Jonathan Anderson Lix and Rix: Variations on a Little-Known Readability Index. , 1983 .

[11]  Delphine Bernhard,et al.  ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking , 2012, SemEval@NAACL-HLT.

[12]  Matthew Shardlow,et al.  A Comparison of Techniques to Automatically Identify Complex Words. , 2013, ACL.

[13]  Ralph Grishman,et al.  The American National Corpus: A Standardized Resource for American English , 2000, LREC.

[14]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[15]  Paloma Moreda,et al.  The Use of Metrics for Measuring Informality Levels in Web 2.0 Texts , 2011, STIL.

[16]  Alejandro Mosquera Amsqr at SemEval-2020 Task 12: Offensive Language Detection Using Neural Networks and Anti-adversarial Features , 2020, SemEval@COLING.

[17]  Lucia Specia,et al.  SemEval 2016 Task 11: Complex Word Identification , 2016, *SEMEVAL.

[18]  Kevyn Collins-Thompson,et al.  Computational Assessment of Text Readability: A Survey of Current and Future Research Running title: Computational Assessment of Text Readability , 2014 .

[19]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[20]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[21]  William H. DuBay The Principles of Readability. , 2004 .

[22]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[23]  E A Smith,et al.  Automated readability index. , 1967, AMRL-TR. Aerospace Medical Research Laboratories.

[24]  M. Brysbaert,et al.  Age-of-acquisition ratings for 30,000 English words , 2012, Behavior research methods.

[25]  Marcos Zampieri,et al.  SemEval-2021 Task 1: Lexical Complexity Prediction , 2021, SEMEVAL.

[26]  Wei Xu,et al.  A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification , 2018, EMNLP.

[27]  Marc Brysbaert,et al.  Subtlex-UK: A New and Improved Word Frequency Database for British English , 2014, Quarterly journal of experimental psychology.

[28]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.