JUST-BLUE at SemEval-2021 Task 1: Predicting Lexical Complexity using BERT and RoBERTa Pre-trained Language Models

Predicting the complexity level of a word or a phrase is considered a challenging task. It is even recognized as a crucial step in numerous NLP applications, such as text rearrangements and text simplification. Early research treated the task as a binary classification task, where the systems anticipated the existence of a word’s complexity (complex versus uncomplicated). Other studies had been designed to assess the level of word complexity using regression models or multi-labeling classification models. Deep learning models show a significant improvement over machine learning models with the rise of transfer learning and pre-trained language models. This paper presents our approach that won the first rank in the SemEval-task1 (sub stask1). We have calculated the degree of word complexity from 0-1 within a text. We have been ranked first place in the competition using the pre-trained language models Bert and RoBERTa, with a Pearson correlation score of 0.788.

[1]  Lucia Specia,et al.  A Report on the Complex Word Identification Shared Task 2018 , 2018, BEA@NAACL-HLT.

[2]  Trevor Brothers,et al.  Anticipating syntax during reading: Evidence from the boundary change paradigm. , 2016, Journal of experimental psychology. Learning, memory, and cognition.

[3]  Lucia Specia,et al.  SemEval 2016 Task 11: Complex Word Identification , 2016, *SEMEVAL.

[4]  Ekaterina Kochmar,et al.  Complex Word Identification as a Sequence Labelling Task , 2019, ACL.

[5]  Mohammad Al-Smadi,et al.  JUSTMasters at SemEval-2020 Task 3: Multilingual Deep Learning Model to Predict the Effect of Context in Word Similarity , 2020, SemEval@COLING.

[6]  Mohammed Bahja Natural Language Processing Applications in Business , 2020 .

[7]  Marcos Zampieri,et al.  CompLex - A New Corpus for Lexical Complexity Predicition from Likert Scale Data , 2020, READI.

[8]  Matthew Shardlow,et al.  A Comparison of Techniques to Automatically Identify Complex Words. , 2013, ACL.

[9]  Mahmoud Hammad,et al.  MLEngineer at SemEval-2020 Task 7: BERT-Flair Based Humor Detection Model (BFHumor) , 2020, SemEval@COLING.

[10]  Marcos Zampieri,et al.  SemEval-2021 Task 1: Lexical Complexity Prediction , 2021, SEMEVAL.

[11]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[12]  Dirk De Hertog,et al.  Deep Learning Architecture for Complex Word Identification , 2018, BEA@NAACL-HLT.

[13]  Lucia Specia,et al.  Complex Word Identification: Challenges in Data Annotation and System Performance , 2017, NLP-TEA@IJCNLP.

[14]  Richard Alan Peters,et al.  A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends , 2019, Knowl. Based Syst..

[15]  Filip Grali'nski,et al.  ApplicaAI at SemEval-2020 Task 11: On RoBERTa-CRF, Span CLS and Whether Self-Training Helps Them , 2020, SemEval@COLING.