Predicting Lexical Complexity in English Texts

The first step in most text simplification is to predict which words are considered complex for a given target population before carrying out lexical substitution. This task is commonly referred to as Complex Word Identification (CWI) and it is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of complex word identification datasets for English.

[1]  David Kauchak,et al.  Learning a Lexical Simplifier Using Wikipedia , 2014, ACL.

[2]  James C. Bezdek,et al.  Decision templates for multiple classifier fusion: an experimental comparison , 2001, Pattern Recognit..

[3]  Masoud Jasbi,et al.  Linguistic Features for Readability Assessment , 2020, BEA@ACL.

[4]  Mamoru Komachi,et al.  Complex Word Identification Based on Frequency in a Learner Corpus , 2018, BEA@NAACL-HLT.

[5]  Krzysztof Wrobel PLUJAGH at SemEval-2016 Task 11: Simple System for Complex Word Identification , 2016, SemEval@NAACL-HLT.

[6]  Prafulla Kumar Choubey,et al.  Garuda & Bhasha at SemEval-2016 Task 11: Complex Word Identification Using Aggregated Learning Models , 2016, *SEMEVAL.

[7]  Nathan Hartmann,et al.  NILC at CWI 2018: Exploring Feature Engineering and Feature Learning , 2018, BEA@NAACL-HLT.

[8]  Ekaterina Kochmar,et al.  Complex Word Identification as a Sequence Labelling Task , 2019, ACL.

[9]  Michal Konkol,et al.  UWB at SemEval-2016 Task 11: Exploring Features for Complex Word Identification , 2016, *SEMEVAL.

[10]  Pushpak Bhattacharyya,et al.  The Whole is Greater than the Sum of its Parts: Towards the Effectiveness of Voting Ensemble Classifiers for Complex Word Identification , 2018, BEA@NAACL-HLT.

[11]  Maja Popovic Complex Word Identification Using Character n-grams , 2018, BEA@NAACL-HLT.

[12]  Elnaz Davoodi,et al.  CLaC at SemEval-2016 Task 11: Exploring linguistic and psycho-linguistic Features for Complex Word Identification , 2016, SemEval@NAACL-HLT.

[13]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[14]  Shervin Malmasi,et al.  MAZA at SemEval-2016 Task 11: Detecting Lexical Complexity Using a Decision Stump Meta-Classifier , 2016, SemEval@NAACL-HLT.

[15]  W. F. Battig,et al.  Handbook of semantic word norms , 1978 .

[16]  Andreas Vlachos,et al.  Strong Baselines for Complex Word Identification across Multiple Languages , 2019, NAACL.

[17]  Gustavo Henrique Paetzold Lexical simplification for non-native English speakers , 2016 .

[18]  R. Logie,et al.  Age-of-acquisition, imagery, concreteness, familiarity, and ambiguity measures for 1,944 words , 1980 .

[19]  Edward L. Thorndike,et al.  The Teacher's Word Book of 30, 000 Words , 2018 .

[20]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[21]  Noah A. Smith,et al.  Comprehensive Annotation of Multiword Expressions in a Social Web Corpus , 2014, LREC.

[22]  K. P. Soman,et al.  AmritaCEN at SemEval-2016 Task 11: Complex Word Identification using Word Embedding , 2016, SemEval@NAACL-HLT.

[23]  Richard Evans,et al.  Combining Multiple Corpora for Readability Assessment for People with Cognitive Disabilities , 2017, BEA@EMNLP.

[24]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[25]  Alexander F. Gelbukh,et al.  Complex Word Identification: Convolutional Neural Network vs. Feature Engineering , 2018, BEA@NAACL-HLT.

[26]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[27]  M. Rugg,et al.  Separating the Brain Regions Involved in Recollection and Familiarity in Recognition Memory , 2005, The Journal of Neuroscience.

[28]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[29]  Onur Kuru,et al.  AI-KU at SemEval-2016 Task 11: Word Embeddings and Substring Features for Complex Word Identification , 2016, *SEMEVAL.

[30]  Marcos Zampieri,et al.  CompLex - A New Corpus for Lexical Complexity Predicition from Likert Scale Data , 2020, READI.

[31]  Matthew Shardlow,et al.  The CW Corpus: A New Resource for Evaluating the Identification of Complex Words , 2013, PITR@ACL.

[32]  Horacio Saggion,et al.  TALN at SemEval-2016 Task 11: Modelling Complex Words by Contextual, Lexical and Semantic Features , 2016, *SEMEVAL.

[33]  Matthew Shardlow,et al.  A Comparison of Techniques to Automatically Identify Complex Words. , 2013, ACL.

[34]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[35]  P. Witty The teacher's word book of 30,000 words. , 1945 .

[36]  Julie Medero,et al.  HMC at SemEval-2016 Task 11: Identifying Complex Words Using Depth-limited Decision Trees , 2016, *SEMEVAL.

[37]  Braja Gopal Patra,et al.  JU_NLP at SemEval-2016 Task 11: Identifying Complex Words in a Sentence , 2016, SemEval@NAACL-HLT.

[38]  David Kauchak Pomona at SemEval-2016 Task 11: Predicting Word Complexity Based on Corpus Frequency , 2016, SemEval@NAACL-HLT.

[39]  Lucia Specia,et al.  Complex Word Identification: Challenges in Data Annotation and System Performance , 2017, NLP-TEA@IJCNLP.

[40]  Shervin Malmasi,et al.  LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles , 2016, *SEMEVAL.

[41]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[42]  J. L. Dolby,et al.  A tape dictionary for linguistic experiments , 1963, AFIPS '63 (Fall).

[43]  Radhika Mamidi,et al.  IIIT at SemEval-2016 Task 11: Complex Word Identification using Nearest Centroid Classification , 2016, *SEMEVAL.

[44]  David Kauchak,et al.  A user-study measuring the effects of lexical simplification and coherence enhancement on perceived and actual text difficulty , 2013, Int. J. Medical Informatics.

[45]  Radu Tudor Ionescu,et al.  UnibucKernel: A kernel-based learning method for complex word identification , 2018, BEA@NAACL-HLT.

[46]  Dirk De Hertog,et al.  Deep Learning Architecture for Complex Word Identification , 2018, BEA@NAACL-HLT.

[47]  Lucia Specia,et al.  A Report on the Complex Word Identification Shared Task 2018 , 2018, BEA@NAACL-HLT.

[48]  José Manuel Martínez Martínez,et al.  USAAR at SemEval-2016 Task 11: Complex Word Identification with Sense Entropy and Sentence Perplexity , 2016, *SEMEVAL.

[49]  Wei Xu,et al.  A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification , 2018, EMNLP.

[50]  Ekaterina Kochmar,et al.  CAMB at CWI Shared Task 2018: Complex Word Identification with Ensemble-Based Voting , 2018, BEA@NAACL-HLT.

[51]  Michael Wilson,et al.  MRC psycholinguistic database: Machine-usable dictionary, version 2.00 , 1988 .

[52]  Horacio Saggion,et al.  LaSTUS/TALN at Complex Word Identification (CWI) 2018 Shared Task , 2018, BEA@NAACL-HLT.

[53]  Christian Biemann,et al.  Multilingual and Cross-Lingual Complex Word Identification , 2017, RANLP.

[54]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[55]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[56]  Timothy Baldwin,et al.  Melbourne at SemEval 2016 Task 11: Classifying Type-level Word Complexity using Random Forests with Corpus and Word List Features , 2016, SemEval@NAACL-HLT.

[57]  Gillin Nat Sensible at SemEval-2016 Task 11: Neural Nonsense Mangled in Ensemble Mess , 2016, SemEval@NAACL-HLT.

[58]  Gustavo Henrique Paetzold,et al.  A survey of lexical simplification , 2018, Emerging Trends in Engineering, Science and Technology for Society, Energy and Environment.

[59]  Lucia Specia,et al.  SemEval 2016 Task 11: Complex Word Identification , 2016, *SEMEVAL.

[60]  Josef van Genabith,et al.  MacSaar at SemEval-2016 Task 11: Zipfian and Character Features for ComplexWord Identification , 2016, *SEMEVAL.

[61]  Lucia Specia,et al.  SV000gg at SemEval-2016 Task 11: Heavy Gauge Complex Word Identification with System Voting , 2016, SemEval@NAACL-HLT.

[62]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[63]  Joachim Bingel,et al.  CoastalCPH at SemEval-2016 Task 11: The importance of designing your Neural Networks right , 2016, *SEMEVAL.

[64]  A. Paivio,et al.  Concreteness, imagery, and meaningfulness values for 925 nouns. , 1968, Journal of experimental psychology.

[65]  David Alfter,et al.  SB@GU at the Complex Word Identification 2018 Shared Task , 2018, BEA@NAACL-HLT.