RS_GV at SemEval-2021 Task 1: Sense Relative Lexical Complexity Prediction

We present the technical report of the system called RS_GV at SemEval-2021 Task 1 on lexical complexity prediction of English words. RS_GV is a neural network using hand-crafted linguistic features in combination with character and word embeddings to predict target words’ complexity. For the generation of the hand-crafted features, we set the target words in relation to their senses. RS_GV predicts the complexity well of biomedical terms but it has problems with the complexity prediction of very complex and very simple target words.

[1]  David Kauchak,et al.  Learning a Lexical Simplifier Using Wikipedia , 2014, ACL.

[2]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[3]  Hugo Mailhot,et al.  MorphoLex: A derivational morphological database for 70,000 English words , 2018, Behavior research methods.

[4]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[5]  Roland Vollgraf,et al.  Pooled Contextualized Embeddings for Named Entity Recognition , 2019, NAACL.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Michael Wilson,et al.  MRC psycholinguistic database: Machine-usable dictionary, version 2.00 , 1988 .

[8]  Matthew Shardlow,et al.  The CW Corpus: A New Resource for Evaluating the Identification of Complex Words , 2013, PITR@ACL.

[9]  E A Smith,et al.  Automated readability index. , 1967, AMRL-TR. Aerospace Medical Research Laboratories.

[10]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[11]  Roland Vollgraf,et al.  FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP , 2019, NAACL.

[12]  Marcos Zampieri,et al.  Predicting Lexical Complexity in English Texts , 2021, ArXiv.

[13]  Antony J. Williams,et al.  Beautiful Data: The Stories Behind Elegant Data Solutions , 2009 .

[14]  Marcos Zampieri,et al.  SemEval-2021 Task 1: Lexical Complexity Prediction , 2021, SEMEVAL.

[15]  R. Gunning The Technique of Clear Writing. , 1968 .

[16]  Horacio Saggion,et al.  LaSTUS/TALN at Complex Word Identification (CWI) 2018 Shared Task , 2018, BEA@NAACL-HLT.

[17]  Elnaz Davoodi,et al.  CLaC at SemEval-2016 Task 11: Exploring linguistic and psycho-linguistic Features for Complex Word Identification , 2016, SemEval@NAACL-HLT.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Nathan Hartmann,et al.  NILC at CWI 2018: Exploring Feature Engineering and Feature Learning , 2018, BEA@NAACL-HLT.

[21]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[22]  Matthew Shardlow,et al.  Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline , 2014, LREC.

[23]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[24]  Ekaterina Kochmar,et al.  CAMB at CWI Shared Task 2018: Complex Word Identification with Ensemble-Based Voting , 2018, BEA@NAACL-HLT.

[25]  Serge Sharoff,et al.  Open-source Corpora: Using the net to fish for linguistic data , 2006 .

[26]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[27]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[28]  Dirk De Hertog,et al.  Deep Learning Architecture for Complex Word Identification , 2018, BEA@NAACL-HLT.

[29]  Ron Daniel,et al.  BioFLAIR: Pretrained Pooled Contextualized Embeddings for Biomedical Sequence Labeling Tasks , 2019, ArXiv.

[30]  Marcos Zampieri,et al.  CompLex - A New Corpus for Lexical Complexity Predicition from Likert Scale Data , 2020, READI.

[31]  Ekaterina Kochmar,et al.  Recursive Context-Aware Lexical Simplification , 2019, EMNLP.

[32]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[33]  Pushpak Bhattacharyya,et al.  The Whole is Greater than the Sum of its Parts: Towards the Effectiveness of Voting Ensemble Classifiers for Complex Word Identification , 2018, BEA@NAACL-HLT.

[34]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[35]  Lucia Specia,et al.  SemEval 2016 Task 11: Complex Word Identification , 2016, *SEMEVAL.