SB@GU at the Complex Word Identification 2018 Shared Task

In this paper, we describe our experiments for the Shared Task on Complex Word Identification (CWI) 2018 (Yimam et al., 2018), hosted by the 13th Workshop on Innovative Use of NLP for Building Educational Applications (BEA) at NAACL 2018. Our system for English builds on previous work for Swedish concerning the classification of words into proficiency levels. We investigate different features for English and compare their usefulness using feature selection methods. For the German, Spanish and French data we use simple systems based on character n-gram models and show that sometimes simple models achieve comparable results to fully feature-engineered systems.

[1]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[2]  A. Jacobs,et al.  Optimal viewing position effect in word recognition: A challenge to current theory. , 1992 .

[3]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[4]  Cédrick Fairon,et al.  FLELex: a graded lexical resource for French foreign learners , 2014, LREC.

[5]  Lucia Specia,et al.  A Report on the Complex Word Identification Shared Task 2018 , 2018, BEA@NAACL-HLT.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[9]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10]  Walt Detmar Meurers,et al.  Assessing the relative reading level of sentence pairs for text simplification , 2014, EACL.

[11]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[12]  David Alfter,et al.  Towards Single Word Lexical Complexity Prediction , 2018, BEA@NAACL-HLT.

[13]  Thomas François,et al.  SVALex: a CEFR-graded Lexical Resource for Swedish Foreign and Second Language Learners , 2016, LREC.

[14]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[15]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[16]  Thomas François,et al.  EFLLex: A Graded Lexical Resource for Learners of English as a Foreign Language , 2018, LREC.

[17]  C. K. Ogden,et al.  Basic English : a general introduction with rules and grammar , 1930 .

[18]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[19]  Edgar A. Smith Devereux Readability Index , 1961 .

[20]  Irina P. Temnikova,et al.  Evaluating the Readability of Text Simplification Output for Readers with Cognitive Disabilities , 2016, LREC.

[21]  C. Davis N-Watch: A program for deriving neighborhood size and other psycholinguistic statistics , 2005, Behavior research methods.

[22]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[23]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[24]  Cédrick Fairon,et al.  Introducing NT2Lex: A Machine-readable CEFR-graded Lexical Resource for Dutch as a Foreign Language , 2017 .