SIMPLEX-PB: A Lexical Simplification Database and Benchmark for Portuguese

Lexical Simplification has the function of changing words or expressions for synonyms that can be understood by a larger number of people. It is very common to have in mind a target audience which will benefit from the task, such as children, low-literacy audiences, and others. In recent years there has been great activity in this field of research, especially for English, but also for other languages such as Japanese and multilingual and cross-lingual scenarios. Few works have children as target audience. Currently, in Brazil, the Programa Nacional do Livro Didatico (PNLD) is an initiative with a broad impact on education, as it aims to choose, acquire, and distribute free textbooks to students in public elementary schools. In this scenario, adapting the level of complexity of a text to the reading ability of a student is a determinant of his/her improvement and whether he/she reaches the level of reading comprehension expected for that school year. On the other hand, there have not been publicly available resources on lexical simplification for Portuguese as yet. Therefore, the development of this material is urgent and welcome. This work compiled the SIMPLEX-PB, the first available corpus of lexical simplification for Brazilian Portuguese. We also make available a benchmark for evaluating the most well-known methods of LS in our dataset.

[1]  Lucia Specia,et al.  SemEval 2016 Task 11: Complex Word Identification , 2016, *SEMEVAL.

[2]  Lucia Specia,et al.  SemEval-2012 Task 1: English Lexical Simplification , 2012, *SEMEVAL.

[3]  Roberto Navigli,et al.  SemEval-2007 Task 10: English Lexical Substitution Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[4]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[5]  Lucia Specia,et al.  A Survey on Lexical Simpli cation , 2017, J. Artif. Intell. Res..

[6]  Christian Biemann,et al.  CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups , 2017, IJCNLP.

[7]  Nathan Hartmann,et al.  Automatic Classification of the Complexity of Nonfiction Texts in Portuguese for Early School Years , 2016, PROPOR.

[8]  Marie-Francine Moens,et al.  A Dataset for the Evaluation of Lexical Simplification , 2012, CICLing.

[9]  Tomoyuki Kajiwara,et al.  Evaluation Dataset and System for Japanese Lexical Simplification , 2015, ACL.

[10]  Goran Glavas,et al.  Simplifying Lexical Simplification: Do We Need Simplified Corpora? , 2015, ACL.

[11]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[12]  Lucia Specia,et al.  Benchmarking Lexical Simplification Systems , 2016, LREC.

[13]  David Kauchak,et al.  Learning a Lexical Simplifier Using Wikipedia , 2014, ACL.

[14]  Wang Ling,et al.  Two/Too Simple Adaptations of Word2Vec for Syntax Problems , 2015, NAACL.

[15]  Maria da Graça Krieger Dicionários escolares e ensino de língua materna , 2012 .

[16]  Matthew Shardlow,et al.  The CW Corpus: A New Resource for Evaluating the Identification of Complex Words , 2013, PITR@ACL.

[17]  Matthew Shardlow,et al.  Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline , 2014, LREC.

[18]  Christian Biemann,et al.  Multilingual and Cross-Lingual Complex Word Identification , 2017, RANLP.

[19]  Tomonori Kodaira,et al.  Controlled and Balanced Dataset for Japanese Lexical Simplification , 2016, ACL.

[20]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[21]  Marc Brysbaert,et al.  Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English , 2009, Behavior research methods.

[22]  Lucia Specia,et al.  Unsupervised Lexical Simplification for Non-Native Speakers , 2016, AAAI.

[23]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24]  Erick Galani Maziero,et al.  A base de dados lexical e a interface web do TeP 2.0: thesaurus eletrônico para o Português do Brasil , 2008, WebMedia.

[25]  Lucia Specia,et al.  Lexical Simplification with Neural Ranking , 2017, EACL.

[26]  Lucia Specia,et al.  A Report on the Complex Word Identification Shared Task 2018 , 2018, BEA@NAACL-HLT.