A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

Psycholinguistic properties of words have been used in various approaches to Natural Language Processing tasks, such as text simplification and readability assessment. Most of these properties are subjective, involving costly and time-consuming surveys to be gathered. Recent approaches use the limited datasets of psycholinguistic properties to extend them automatically to large lexicons. However, some of the resources used by such approaches are not available to most languages. This study presents a method to infer psycholinguistic properties for Brazilian Portuguese (BP) using regressors built with a light set of features usually available for less resourced languages: word length, frequency lists, lexical databases composed of school dictionaries and word embedding models. The correlations between the properties inferred are close to those obtained by related works. The resulting resource contains 26,874 words in BP annotated with concreteness, age of acquisition, imageability and subjective frequency.

[1]  Walt Detmar Meurers,et al.  Readability-based Sentence Ranking for Evaluating Text Simplification , 2016, ArXiv.

[2]  Manuela L. Cameirão,et al.  Age-of-acquisition norms for a set of 1,749 Portuguese words , 2010, Behavior research methods.

[3]  Arthur C. Graesser,et al.  Computational Analyses of Multilevel Discourse Comprehension , 2011, Top. Cogn. Sci..

[4]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[5]  Samar Husain,et al.  Quantifying sentence complexity based on eye-tracking measures , 2016, CL4LC@COLING 2016.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  J. Frederico Marques Normas de imagética e concreteza para substantivos comuns , 2013 .

[8]  Lucia Specia,et al.  Inferring Psycholinguistic Properties of Words , 2016, NAACL.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Goiara Mendonça de Castilho,et al.  Normas de concretude para 909 palavras da língua portuguesa , 2007 .

[11]  Montserrat Comesaña,et al.  The Minho Word Pool: Norms for imageability, concreteness, and subjective frequency for 3,800 Portuguese words , 2017, Behavior research methods.

[12]  Lucia Specia,et al.  Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Words , 2016, COLING.

[13]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[14]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[15]  J. Frederico Marques,et al.  Estimated age of acquisition norms for 834 Portuguese nouns and their relation with other psycholinguistic variables , 2007, Behavior research methods.

[16]  Shi Feng,et al.  Simulating Human Ratings on Word Concreteness , 2011, FLAIRS.

[17]  Walt Detmar Meurers,et al.  Readability assessment for text simplification: From analysing documents to identifying sentential simplifications , 2014 .