CAMB at CWI Shared Task 2018: Complex Word Identification with Ensemble-Based Voting

This paper presents the winning systems we submitted to the Complex Word Identification Shared Task 2018. We describe our best performing systems’ implementations and discuss our key findings from this research. Our best-performing systems achieve an F1 score of 0.8792 on the NEWS, 0.8430 on the WIKINEWS and 0.8115 on the WIKIPEDIA test sets in the monolingual English binary classification track, and a mean absolute error of 0.0558 on the NEWS, 0.0674 on the WIKINEWS and 0.0739 on the WIKIPEDIA test sets in the probabilistic track.

[1]  R. Logie,et al.  Age-of-acquisition, imagery, concreteness, familiarity, and ambiguity measures for 1,944 words , 1980 .

[2]  Aziz A. Boxwala,et al.  Research Paper: Estimating Consumer Familiarity with Health Terminology: A Context-based Approach , 2008, J. Am. Medical Informatics Assoc..

[3]  Horacio Saggion,et al.  TALN at SemEval-2016 Task 11: Modelling Complex Words by Contextual, Lexical and Semantic Features , 2016, *SEMEVAL.

[4]  W. F. Battig,et al.  Handbook of semantic word norms , 1978 .

[5]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[6]  Shervin Malmasi,et al.  MAZA at SemEval-2016 Task 11: Detecting Lexical Complexity Using a Decision Stump Meta-Classifier , 2016, SemEval@NAACL-HLT.

[7]  Krzysztof Wrobel PLUJAGH at SemEval-2016 Task 11: Simple System for Complex Word Identification , 2016, SemEval@NAACL-HLT.

[8]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[9]  Noémie Elhadad,et al.  Putting it Simply: a Context-Aware Approach to Lexical Simplification , 2011, ACL.

[10]  Siobhan Devlin,et al.  Simplifying Text for Language-Impaired Readers , 1999, EACL.

[11]  Michal Konkol,et al.  UWB at SemEval-2016 Task 11: Exploring Features for Complex Word Identification , 2016, *SEMEVAL.

[12]  Shervin Malmasi,et al.  LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles , 2016, *SEMEVAL.

[13]  C. K. Ogden,et al.  Basic English : international second language , 1968 .

[14]  David Kauchak,et al.  Simple English Wikipedia: A New Text Simplification Task , 2011, ACL.

[15]  Emanuele Schiavi,et al.  Biological and Medical Data Analysis , 2004, Lecture Notes in Computer Science.

[16]  David Kauchak,et al.  User Evaluation of the Effects of a Text Simplification Algorithm Using Term Familiarity on Perception, Understanding, Learning, and Information Retention , 2013, Journal of medical Internet research.

[17]  Qing Zeng-Treitler,et al.  A Text Corpora-Based Estimation of the Familiarity of Health Terminology , 2005, ISBMDA.

[18]  David Kauchak,et al.  Learning a Lexical Simplifier Using Wikipedia , 2014, ACL.

[19]  William H. DuBay The Principles of Readability. , 2004 .

[20]  Matthew Shardlow,et al.  Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline , 2014, LREC.

[21]  Lucia Specia,et al.  SemEval 2016 Task 11: Complex Word Identification , 2016, *SEMEVAL.

[22]  Lucia Specia,et al.  SV000gg at SemEval-2016 Task 11: Heavy Gauge Complex Word Identification with System Voting , 2016, SemEval@NAACL-HLT.

[23]  Christian Biemann,et al.  Multilingual and Cross-Lingual Complex Word Identification , 2017, RANLP.

[24]  Timothy Baldwin,et al.  Sequence Effects in Crowdsourced Annotations , 2017, EMNLP.

[25]  Braja Gopal Patra,et al.  JU_NLP at SemEval-2016 Task 11: Identifying Complex Words in a Sentence , 2016, SemEval@NAACL-HLT.

[26]  Matthew Shardlow,et al.  The CW Corpus: A New Resource for Evaluating the Identification of Complex Words , 2013, PITR@ACL.

[27]  Chris Callison-Burch,et al.  Problems in Current Text Simplification Research: New Data Can Help , 2015, TACL.

[28]  L. Vowels The Canadian Modern Language Review/La Revue canadienne des langues vivantes , 2010 .

[29]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[30]  Noémie Elhadad Comprehending Technical Texts: Predicting and Defining Unfamiliar Terms , 2006, AMIA.

[31]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[32]  Matthew Shardlow,et al.  A Comparison of Techniques to Automatically Identify Complex Words. , 2013, ACL.

[33]  Mark Davies The 385+ million word Corpus of Contemporary American English (1990―2008+): Design, architecture, and linguistic insights , 2009 .

[34]  Christian Biemann,et al.  CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups , 2017, IJCNLP.

[35]  Lucia Specia,et al.  Complex Word Identification: Challenges in Data Annotation and System Performance , 2017, NLP-TEA@IJCNLP.

[36]  S. Rebecca Thomas,et al.  WordNet-based lexical simplification of a document , 2012, KONVENS.

[37]  I. Nation How Large a Vocabulary Is Needed for Reading and Listening? , 2006 .

[38]  Horacio Saggion,et al.  Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish , 2012, COLING.

[39]  Victor Maojo,et al.  Biological and Medical Data Analysis, 6th International Symposium, ISBMDA 2005, Aveiro, Portugal, November 10-11, 2005, Proceedings , 2005, ISBMDA.

[40]  Josef van Genabith,et al.  MacSaar at SemEval-2016 Task 11: Zipfian and Character Features for ComplexWord Identification , 2016, *SEMEVAL.

[41]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[42]  Lucia Specia,et al.  Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Words , 2016, COLING.

[43]  Lucia Specia,et al.  A Report on the Complex Word Identification Shared Task 2018 , 2018, BEA@NAACL-HLT.